[HACKERS] Parallel query fails on standby server
Hi All, While testing a parallel scan feature on standby server, it is found that the parallel query fails with an error "*ERROR: failed to initialize transaction_read_only to 0*". Following are the steps used to reproduce the issue: *Master :-* edb=# create table ert(n int); edb=# insert into ert values (generate_series(1,500)); edb=# analyze ert; edb=# vacuum ert; *Slave :-* edb=# set max_parallel_degree =5; SET edb=# explain analyze verbose select * from ert where n<=1000; ERROR: failed to initialize transaction_read_only to 0 CONTEXT: parallel worker, PID 26042 *Root cause Analysis:* After debugging the worker, it is observed that in *RestoreGUCState()*, if a guc var can't be skipped it is Initialiazed with a default value and in this process when a guc variable "*transaction_read_only*" is being Initialzed it calls a check_hook *check_transaction_read_only()* which eventually fails due to below check which says the guc var "*transaction_read_only*" can't be set while recovery is in progress: *if (RecoveryInProgress()){GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);GUC_check_errmsg("cannot set transaction read-write mode during recovery");return false;}* *Solution:* Make use of a global variable "*InitializingParallelWorker"* to protect the check for *RecoveryInProgress()* when Parallel Worker is being Initialsed. PFA patch to fix the issue. With Regards, Ashutosh Sharma EnterpriseDB: *http://www.enterprisedb.com <http://www.enterprisedb.com>* diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c index 903b3a6..c7173c9 100644 --- a/src/backend/commands/variable.c +++ b/src/backend/commands/variable.c @@ -482,11 +482,13 @@ show_log_timezone(void) * nothing since XactReadOnly will be reset by the next StartTransaction(). * The IsTransactionState() test protects us against trying to check * RecoveryInProgress() in contexts where shared memory is not accessible. + * We can also skip the check for RecoveryInProgress while initializing the + * Parallel Workers by making use of the global variable InitializingParallelWorker. */ bool check_transaction_read_only(bool *newval, void **extra, GucSource source) { - if (*newval == false && XactReadOnly && IsTransactionState()) + if (*newval == false && XactReadOnly && IsTransactionState() && !InitializingParallelWorker) { /* Can't go to r/w mode inside a r/o transaction */ if (IsSubTransaction()) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit 6150a1b0
Hi All, I have been working on this issue for last few days trying to investigate what could be the probable reasons for Performance degradation at commit 6150a1b0. After going through Andres patch for moving buffer I/O and content lock out of Main Tranche, the following two things come into my mind. 1. Content Lock is no more used as a pointer in BufferDesc structure instead it is included as LWLock structure. This basically increases the overall structure size from 64bytes to 80 bytes. Just to investigate on this, I have reverted the changes related to content lock from commit 6150a1b0 and taken at least 10 readings and with this change i can see that the overall performance is similar to what it was observed earlier i.e. before commit 6150a1b0. 2. Secondly, i can see that the BufferDesc structure padding is 64 bytes however the PG CACHE LINE ALIGNMENT is 128 bytes. Also, after changing the BufferDesc structure padding size to 128 bytes along with the changes mentioned in above point #1, I see that the overall performance is again similar to what is observed before commit 6150a1b0. Please have a look into the attached test report that contains the performance test results for all the scenarios discussed above and let me know your thoughts. With Regards, Ashutosh Sharma EnterpriseDB: *http://www.enterprisedb.com <http://www.enterprisedb.com>* On Sat, Feb 27, 2016 at 9:26 AM, Andres Freund wrote: > On February 26, 2016 7:55:18 PM PST, Amit Kapila > wrote: > >On Sat, Feb 27, 2016 at 12:41 AM, Andres Freund > >wrote: > >> > >> Hi, > >> > >> On 2016-02-25 12:56:39 +0530, Amit Kapila wrote: > >> > From past few weeks, we were facing some performance degradation in > >the > >> > read-only performance bench marks in high-end machines. My > >colleague > >> > Mithun, has tried by reverting commit ac1d794 which seems to > >degrade the > >> > performance in HEAD on high-end m/c's as reported previously[1], > >but > >still > >> > we were getting degradation, then we have done some profiling to > >see > >what > >> > has caused it and we found that it's mainly caused by spin lock > >when > >> > called via pin/unpin buffer and then we tried by reverting commit > >6150a1b0 > >> > which has recently changed the structures in that area and it turns > >out > >> > that reverting that patch, we don't see any degradation in > >performance. > >> > The important point to note is that the performance degradation > >doesn't > >> > occur every time, but if the tests are repeated twice or thrice, it > >> > is easily visible. > >> > >> > m/c details > >> > IBM POWER-8 > >> > 24 cores,192 hardware threads > >> > RAM - 492GB > >> > > >> > Non-default postgresql.conf settings- > >> > shared_buffers=16GB > >> > max_connections=200 > >> > min_wal_size=15GB > >> > max_wal_size=20GB > >> > checkpoint_timeout=900 > >> > maintenance_work_mem=1GB > >> > checkpoint_completion_target=0.9 > >> > > >> > scale_factor - 300 > >> > > >> > Performance at commit 43cd468cf01007f39312af05c4c92ceb6de8afd8 is > >469002 at > >> > 64-client count and then at > >6150a1b08a9fe7ead2b25240be46dddeae9d98e1, it > >> > went down to 200807. This performance numbers are median of 3 > >15-min > >> > pgbench read-only tests. The similar data is seen even when we > >revert > >the > >> > patch on latest commit. We have yet to perform detail analysis as > >to > >why > >> > the commit 6150a1b08a9fe7ead2b25240be46dddeae9d98e1 lead to > >degradation, > >> > but any ideas are welcome. > >> > >> Ugh. Especially the varying performance is odd. Does it vary between > >> restarts, or is it just happenstance? If it's the former, we might > >be > >> dealing with some alignment issues. > >> > > > >It varies between restarts. > > > >> > >> If not, I wonder if the issue is massive buffer header contention. As > >a > >> LL/SC architecture acquiring the content lock might interrupt buffer > >> spinlock acquisition and vice versa. > >> > >> Does applying the patch from > > > http://archives.postgresql.org/message-id/CAPpHfdu77FUi5eiNb%2BjRPFh5S%2B1U%2B8ax4Zw%3DAUYgt%2BCPsKiyWw%40mail.gmail.com > >> change the picture? > >> > > > >Not tried, but if this is alignment issue as you are suspecting above, > >then > >does it make sense to try this out? > > It's the other theory I had. And it's additionally useful testing > regardless of this regression... > > --- > Please excuse brevity and formatting - I am writing this on my mobile > phone. > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > Performance_Results.xlsx Description: MS-Excel 2007 spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit 6150a1b0
Hi, I am getting some reject files while trying to apply "*pinunpin-cas-5.patch*" attached with the thread, *http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com <http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com>* Note: I am applying this patch on top of commit " *6150a1b08a9fe7ead2b25240be46dddeae9d98e1*". With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Fri, Mar 25, 2016 at 9:29 AM, Amit Kapila wrote: > On Wed, Mar 23, 2016 at 1:59 PM, Ashutosh Sharma > wrote: > > > > Hi All, > > > > I have been working on this issue for last few days trying to > investigate what could be the probable reasons for Performance degradation > at commit 6150a1b0. After going through Andres patch for moving buffer I/O > and content lock out of Main Tranche, the following two things come into my > > mind. > > > > 1. Content Lock is no more used as a pointer in BufferDesc structure > instead it is included as LWLock structure. This basically increases the > overall structure size from 64bytes to 80 bytes. Just to investigate on > this, I have reverted the changes related to content lock from commit > 6150a1b0 and taken at least 10 readings and with this change i can see that > the overall performance is similar to what it was observed earlier i.e. > before commit 6150a1b0. > > > > 2. Secondly, i can see that the BufferDesc structure padding is 64 bytes > however the PG CACHE LINE ALIGNMENT is 128 bytes. Also, after changing the > BufferDesc structure padding size to 128 bytes along with the changes > mentioned in above point #1, I see that the overall performance is again > similar to what is observed before commit 6150a1b0. > > > > Please have a look into the attached test report that contains the > performance test results for all the scenarios discussed above and let me > know your thoughts. > > > > So this indicates that changing back content lock as LWLock* in BufferDesc > brings back the performance which indicates that increase in BufferDesc > size to more than 64bytes on this platform has caused regression. I think > it is worth trying the patch [1] as suggested by Andres as that will reduce > the size of BufferDesc which can bring back the performance. Can you once > try the same? > > [1] - > http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com > > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com >
Re: [HACKERS] Performance degradation in commit 6150a1b0
Hi, As mentioned in my earlier mail i was not able to apply *pinunpin-cas-5.patch* on commit *6150a1b0, *therefore i thought of applying it on the latest commit and i was able to do it successfully. I have now taken the performance readings at latest commit i.e. *76281aa9* with and without applying *pinunpin-cas-5.patch* and my observations are as follows, 1. I can still see that the current performance lags by 2-3% from the expected performance when *pinunpin-cas-5.patch *is applied on the commit *76281aa9.* 2. When *pinunpin-cas-5.patch *is ignored and performance is measured at commit *76281aa9 *the overall performance lags by 50-60% from the expected performance. *Note:* Here, the expected performance is the performance observed before commit *6150a1b0 *when* ac1d794 *is reverted. Please refer to the attached performance report sheet for more insights. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com/> On Sat, Mar 26, 2016 at 9:31 PM, Ashutosh Sharma wrote: > Hi, > > I am getting some reject files while trying to apply " > *pinunpin-cas-5.patch*" attached with the thread, > > > > *http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com > <http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com>* > Note: I am applying this patch on top of commit " > *6150a1b08a9fe7ead2b25240be46dddeae9d98e1*". > > With Regards, > Ashutosh Sharma > EnterpriseDB: http://www.enterprisedb.com > > On Fri, Mar 25, 2016 at 9:29 AM, Amit Kapila > wrote: > >> On Wed, Mar 23, 2016 at 1:59 PM, Ashutosh Sharma >> wrote: >> > >> > Hi All, >> > >> > I have been working on this issue for last few days trying to >> investigate what could be the probable reasons for Performance degradation >> at commit 6150a1b0. After going through Andres patch for moving buffer I/O >> and content lock out of Main Tranche, the following two things come into my >> > mind. >> > >> > 1. Content Lock is no more used as a pointer in BufferDesc structure >> instead it is included as LWLock structure. This basically increases the >> overall structure size from 64bytes to 80 bytes. Just to investigate on >> this, I have reverted the changes related to content lock from commit >> 6150a1b0 and taken at least 10 readings and with this change i can see that >> the overall performance is similar to what it was observed earlier i.e. >> before commit 6150a1b0. >> > >> > 2. Secondly, i can see that the BufferDesc structure padding is 64 >> bytes however the PG CACHE LINE ALIGNMENT is 128 bytes. Also, after >> changing the BufferDesc structure padding size to 128 bytes along with the >> changes mentioned in above point #1, I see that the overall performance is >> again similar to what is observed before commit 6150a1b0. >> > >> > Please have a look into the attached test report that contains the >> performance test results for all the scenarios discussed above and let me >> know your thoughts. >> > >> >> So this indicates that changing back content lock as LWLock* in >> BufferDesc brings back the performance which indicates that increase in >> BufferDesc size to more than 64bytes on this platform has caused >> regression. I think it is worth trying the patch [1] as suggested by >> Andres as that will reduce the size of BufferDesc which can bring back the >> performance. Can you once try the same? >> >> [1] - >> http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com >> >> With Regards, >> Amit Kapila. >> EnterpriseDB: http://www.enterprisedb.com >> > > Performance_Results_with_pinunpin_cas_changes.xlsx Description: MS-Excel 2007 spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Performance degradation in commit 6150a1b0
Hi, I am unable to revert 6150a1b0 on top of recent commit in the master branch. It seems like there has been some commit made recently that has got dependency on 6150a1b0. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Sun, Mar 27, 2016 at 5:45 PM, Andres Freund wrote: > Hi, > > On 2016-03-27 02:34:32 +0530, Ashutosh Sharma wrote: > > As mentioned in my earlier mail i was not able to apply > > *pinunpin-cas-5.patch* on commit *6150a1b0, > > That's not surprising; that's pretty old. > > > *therefore i thought of applying it on the latest commit and i was > > able to do it successfully. I have now taken the performance readings > > at latest commit i.e. *76281aa9* with and without applying > > *pinunpin-cas-5.patch* and my observations are as follows, > > > > > 1. I can still see that the current performance lags by 2-3% from the > > expected performance when *pinunpin-cas-5.patch *is applied on the commit > > > > *76281aa9.* > > 2. When *pinunpin-cas-5.patch *is ignored and performance is measured at > > commit *76281aa9 *the overall performance lags by 50-60% from the > expected > > performance. > > > > *Note:* Here, the expected performance is the performance observed before > > commit *6150a1b0 *when* ac1d794 *is reverted. > > Thanks for doing these benchmarks. What's the performance if you revert > 6150a1b0 on top of a recent master? There've been a lot of other patches > influencing performance since 6150a1b0, so minor performance differences > aren't necessarily meaningful; especially when that older version then > had other patches reverted. > > Thanks, > > Andres >
Re: [HACKERS] "Some tests to cover hash_index"
Hi All, I have reverified the code coverage for hash index code using the test file (commit-hash_coverage_test) attached with this mailing list and have found that some of the code in _hash_squeezebucket() function flow is not being covered. For this i have added a small testcase on top of 'commit hash_coverage_test' patch. I have done this mainly to test Amit's WAL for hash index patch [1]. I have also removed the warning message that we used to get for hash index like 'WARNING: hash indexes are not WAL-logged and their use is discouraged' as this message is now no more visible w.r.t hash index after the WAL patch for hash index. Please have a look and let me know your thoughts. [1] - https://www.postgresql.org/message-id/CAA4eK1JOBX%3DYU33631Qh-XivYXtPSALh514%2BjR8XeD7v%2BK3r_Q%40mail.gmail.com With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Sat, Aug 6, 2016 at 9:41 AM, Amit Kapila wrote: > On Thu, Aug 4, 2016 at 7:24 PM, Mithun Cy > wrote: > >> I am attaching the patch to improve some coverage of hash index code [1]. >> I have added some basic tests, which mainly covers overflow pages. It >> took 2 sec extra time in my machine in parallel schedule. >> >> >> >> >> Hit Total Coverage >> old tests Line Coverage 780 1478 52.7 >> >> Function Coverage 63 85 74.1 >> improvement after tests Line Coverage 1181 1478 79.9 % >> >> Function Coverage 78 85 91.8 % >> >> > > I think the code coverage improvement for hash index with these tests > seems to be quite good, however time for tests seems to be slightly on > higher side. Do anybody have better suggestion for these tests? > > diff --git a/src/test/regress/sql/concurrent_hash_index.sql > b/src/test/regress/sql/concurrent_hash_index.sql > I wonder why you have included a new file for these tests, why can't be > these added to existing hash_index.sql. > > +-- > +-- Cause some overflow insert and splits. > +-- > +CREATE TABLE con_hash_index_table (keycol INT); > +CREATE INDEX con_hash_index on con_hash_index_table USING HASH (keycol); > > The relation name con_hash_index* choosen in above tests doesn't seem to > be appropriate, how about hash_split_heap* or something like that. > > Register your patch in latest CF (https://commitfest.postgresql.org/10/) > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com >
Re: [HACKERS] "Some tests to cover hash_index"
Hi, I missed to attach the patch in my previous mail. Here i attach the patch. With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Tue, Aug 23, 2016 at 11:47 AM, Ashutosh Sharma wrote: > Hi All, > > I have reverified the code coverage for hash index code using the test > file (commit-hash_coverage_test) attached with this mailing list and have > found that some of the code in _hash_squeezebucket() function flow is not > being covered. For this i have added a small testcase on top of 'commit > hash_coverage_test' patch. I have done this mainly to test Amit's WAL for > hash index patch [1]. > > I have also removed the warning message that we used to get for hash index > like 'WARNING: hash indexes are not WAL-logged and their use is > discouraged' as this message is now no more visible w.r.t hash index after > the WAL patch for hash index. Please have a look and let me know your > thoughts. > > [1] - https://www.postgresql.org/message-id/CAA4eK1JOBX% > 3DYU33631Qh-XivYXtPSALh514%2BjR8XeD7v%2BK3r_Q%40mail.gmail.com > > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > > On Sat, Aug 6, 2016 at 9:41 AM, Amit Kapila > wrote: > >> On Thu, Aug 4, 2016 at 7:24 PM, Mithun Cy >> wrote: >> >>> I am attaching the patch to improve some coverage of hash index code [1]. >>> I have added some basic tests, which mainly covers overflow pages. It >>> took 2 sec extra time in my machine in parallel schedule. >>> >>> >>> >>> >>> Hit Total Coverage >>> old tests Line Coverage 780 1478 52.7 >>> >>> Function Coverage 63 85 74.1 >>> improvement after tests Line Coverage 1181 1478 79.9 % >>> >>> Function Coverage 78 85 91.8 % >>> >>> >> >> I think the code coverage improvement for hash index with these tests >> seems to be quite good, however time for tests seems to be slightly on >> higher side. Do anybody have better suggestion for these tests? >> >> diff --git a/src/test/regress/sql/concurrent_hash_index.sql >> b/src/test/regress/sql/concurrent_hash_index.sql >> I wonder why you have included a new file for these tests, why can't be >> these added to existing hash_index.sql. >> >> +-- >> +-- Cause some overflow insert and splits. >> +-- >> +CREATE TABLE con_hash_index_table (keycol INT); >> +CREATE INDEX con_hash_index on con_hash_index_table USING HASH (keycol); >> >> The relation name con_hash_index* choosen in above tests doesn't seem to >> be appropriate, how about hash_split_heap* or something like that. >> >> Register your patch in latest CF (https://commitfest.postgresql.org/10/) >> >> -- >> With Regards, >> Amit Kapila. >> EnterpriseDB: http://www.enterprisedb.com >> > > diff --git a/src/test/regress/expected/concurrent_hash_index.out b/src/test/regress/expected/concurrent_hash_index.out index c3b8036..60191c0 100644 --- a/src/test/regress/expected/concurrent_hash_index.out +++ b/src/test/regress/expected/concurrent_hash_index.out @@ -3,7 +3,6 @@ -- CREATE TABLE con_hash_index_table (keycol INT); CREATE INDEX con_hash_index on con_hash_index_table USING HASH (keycol); -WARNING: hash indexes are not WAL-logged and their use is discouraged INSERT INTO con_hash_index_table VALUES (1); INSERT INTO con_hash_index_table SELECT * from con_hash_index_table; INSERT INTO con_hash_index_table SELECT * from con_hash_index_table; @@ -75,5 +74,15 @@ DROP TABLE hash_ovfl_temp_heap CASCADE; CREATE TABLE hash_ovfl_heap_float4 (x float4, y int); INSERT INTO hash_ovfl_heap_float4 VALUES (1.1,1); CREATE INDEX hash_idx ON hash_ovfl_heap_float4 USING hash (x); -WARNING: hash indexes are not WAL-logged and their use is discouraged DROP TABLE hash_ovfl_heap_float4 CASCADE; +-- +-- Test hash index insertion with squeeze bucket (XLOG_HASH_MOVE_PAGE_CONTENTS +-- WAL record type). +-- +CREATE TABLE hash_split_buckets (seqno int4, random int4); +CREATE INDEX hash_idx ON hash_split_buckets USING hash (random int4_ops) +with (fillfactor = 10); +INSERT INTO hash_split_buckets (seqno, random) SELECT a, a*5 FROM +GENERATE_SERIES(1, 10) a; +REINDEX INDEX hash_idx; +DROP TABLE hash_split_buckets; diff --git a/src/test/regress/sql/concurrent_hash_index.sql b/src/test/regress/sql/concurrent_hash_index.sql index 8f930d5..d3b09d0 100644 --- a/src/test/regress/sql/concurrent_hash_index.sql +++ b/src/test/regress/sql/concurrent_hash_index.sql @@ -78,3 +78,15 @@ CREATE TABLE hash_ovfl_heap_float4 (x float4, y int); INSERT INTO hash_ovfl_heap_float4 VALUES (1.1,1); CREATE INDEX hash_idx ON hash_ovfl_heap_float4 USING hash (x); DROP TABLE hash_ovfl_heap_float4
Re: [HACKERS] Write Ahead Logging for Hash Indexes
Hi All, Following are the steps that i have followed to verify the WAL Logging of hash index, 1. I used Mithun's patch to improve coverage of hash index code [1] to verify the WAL Logging of hash index. Firstly i have confirmed if all the XLOG records associated with hash index are being covered or not using this patch. In case if any of the XLOG record for hash index operation is not being covered i have added a testcase for it. I have found that one of the XLOG record 'XLOG_HASH_MOVE_PAGE_CONTENTS' was not being covered and added a small testcase for the same. The patch for this is available @ [2]. 2. I executed the regression test suite and found all the hash indexes that are getting created as a part of regression test suite using the below query. SELECT t.relname index_name, t.oid FROM pg_class t JOIN pg_am idx ON idx.oid = t.relam WHERE idx.amname = 'hash'; 3. Thirdly, I have calculated the number of pages associated with each hash index and compared every page of hash index on master and standby server using pg_filedump tool. As for example if the number of pages associated with 'con_hash_index' is 10 then here is what i did, On master: - select pg_relation_filepath('con_hash_index'); pg_relation_filepath -- base/16408/16433 (1 row) ./pg_filedump -if -R 0 9 /home/edb/git-clone-postgresql/postgresql/TMP/postgres/master/base/16408/16433 > /tmp/file1 On Slave: --- select pg_relation_filepath('con_hash_index'); pg_relation_filepath -- base/16408/16433 (1 row) ./pg_filedump -if -R 0 9 /home/edb/git-clone-postgresql/postgresql/TMP/postgres/standby/base/16408/16433 > /tmp/file2 compared file1 and file2 using some diff tool. Following are the list of hash indexes that got created inside regression database when regression test suite was executed on a master server. hash_i4_index hash_name_index hash_txt_index hash_f8_index con_hash_index hash_idx In short, this is all i did and found no issues during testing. Please let me know if you need any further details. I would like to Thank Amit for his support and guidance during the testing phase. [1] - https://www.postgresql.org/message-id/CAA4eK1JOBX%3DYU33631Qh-XivYXtPSALh514%2BjR8XeD7v%2BK3r_Q%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAE9k0PkNjryhSiG53mjnKFhi%2BMipJMjSa%3DYkH-UeW3bfr1HPJQ%40mail.gmail.com With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] "Some tests to cover hash_index"
Hi, > Well, that change should be part of Amit's patch to add WAL logging, > not this patch, whose mission is just to improve test coverage. I have just removed the warning message from expected output file as i have performed the testing on Amit's latest patch that removes this warning message from the hash index code. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Write Ahead Logging for Hash Indexes
> Thanks to Ashutosh Sharma for doing the testing of the patch and > helping me in analyzing some of the above issues. Hi All, I would like to summarize the test-cases that i have executed for validating WAL logging in hash index feature. 1) I have mainly ran the pgbench test with read-write workload at the scale factor of 1000 and various client counts like 16, 64 and 128 for time duration of 30 mins, 1 hr and 24 hrs. I have executed this test on highly configured power2 machine with 128 cores and 512GB of RAM. I ran the test-case both with and without the replication setup. Please note that i have changed the schema of pgbench tables created during initialisation phase. The new schema of pgbench tables looks as shown below on both master and standby: postgres=# \d pgbench_accounts Table "public.pgbench_accounts" Column | Type | Modifiers --+---+--- aid | integer | not null bid | integer | abalance | integer | filler | character(84) | Indexes: "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) "pgbench_accounts_bid" hash (bid) postgres=# \d pgbench_history Table "public.pgbench_history" Column |Type | Modifiers +-+--- tid| integer | bid| integer | aid| integer | delta | integer | mtime | timestamp without time zone | filler | character(22) | Indexes: "pgbench_history_bid" hash (bid) 2) I have also made use of the following tools for analyzing the hash index tables created on master and standby. a) pgstattuple : Using this tool, i could verfiy the tuple level statistics which includes index table physical length, dead tuple percentageand other infos on both master and standby. However, i could see below error message when using this tool for one of the hash index table on both master and standby server. postgres=# SELECT * FROM pgstattuple('pgbench_history_bid'); ERROR: index "pgbench_history_bid" contains unexpected zero page at block 2482297 To investigate on this, i made use of pageinspect contrib module and came to know that above block contains an uninitialised page. But, this is quite possible in hash index as here the bucket split happens in the power-of-2 and we may find some of the uninitialised bucket pages. b) pg_filedump : Using this tool, i could take a dump of all the hash index tables on master and standy and compare the dump file to verify if there is any difference between hash index pages on both master and standby. In short, this is all i did to verify the patch for wal logging in hash index. I would like thank Amit and Robert for their valuable inputs during the hash index testing phase. I am also planning to verify wal logging in hash index using Kuntal's patch for WAL consistency check once it is stablized. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Write Ahead Logging for Hash Indexes
Hi All, I am getting following error when running the test script shared by Jeff -[1] . The error is observed upon executing the test script for around 3-4 hrs. 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR: lock buffer_content 1 is not held 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,1) 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR: lock buffer_content 10 is not held 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,1) 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR: lock buffer_content 10 is not held 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT: insert into foo (index) select $1 from generate_series(1,1) [1]- https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com Please note that i am performing the test on latest patch for concurrent hash index and WAL log in hash index shared by Amit yesterday. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen wrote: > On 09/13/2016 07:41 AM, Amit Kapila wrote: >>> >>> README: >>> +in_complete split flag. The reader algorithm works correctly, as it >>> will >>> scan >>> >>> What flag ? >>> >> >> in-complete-split flag which indicates that split has to be finished >> for that particular bucket. The value of these flags are >> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old >> bucket respectively. It is set in hasho_flag in special area of page. >> I have slightly expanded the definition in README, but not sure if it >> is good idea to mention flags defined in hash.h. Let me know, if still >> it is unclear or you want some thing additional to be added in README. >> > > I think it is better now. > >>> hashxlog.c: >>> >>> hash_xlog_move_page_contents >>> hash_xlog_squeeze_page >>> >>> Both have "bukcetbuf" (-> "bucketbuf"), and >>> >>> + if (BufferIsValid(bukcetbuf)); >>> >>> -> >>> >>> + if (BufferIsValid(bucketbuf)) >>> >>> and indent following line: >>> >>> LockBufferForCleanup(bukcetbuf); >>> >>> hash_xlog_delete >>> >>> has the "if" issue too. >>> >> >> Fixed all the above cosmetic issues. >> > > I meant there is an extra ';' on the "if" statements: > > + if (BufferIsValid(bukcetbuf)); <-- > > in hash_xlog_move_page_contents, hash_xlog_squeeze_page and > hash_xlog_delete. > > > Best regards, > Jesper > > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Write Ahead Logging for Hash Indexes
Hi All, Below is the backtrace for the issue reported in my earlier mail [1]. >From the callstack it looks like we are trying to release lock on a meta page twice in _hash_expandtable(). (gdb) bt #0 0x007b01cf in LWLockRelease (lock=0x7f55f59d0570) at lwlock.c:1799 #1 0x0078990c in LockBuffer (buffer=2, mode=0) at bufmgr.c:3540 #2 0x004a5d3c in _hash_chgbufaccess (rel=0x7f5605b608c0, buf=2, from_access=1, to_access=-1) at hashpage.c:331 #3 0x004a722d in _hash_expandtable (rel=0x7f5605b608c0, metabuf=2) at hashpage.c:995 #4 0x004a316a in _hash_doinsert (rel=0x7f5605b608c0, itup=0x2ba5030) at hashinsert.c:313 #5 0x004a0d85 in hashinsert (rel=0x7f5605b608c0, values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", ht_ctid=0x2c2e4f4, heapRel=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at hash.c:248 #6 0x004c5a16 in index_insert (indexRelation=0x7f5605b608c0, values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", heap_t_ctid=0x2c2e4f4, heapRelation=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at indexam.c:204 #7 0x0063f2d5 in ExecInsertIndexTuples (slot=0x2b9a2c0, tupleid=0x2c2e4f4, estate=0x2b99c00, noDupErr=0 '\000', specConflict=0x0, arbiterIndexes=0x0) at execIndexing.c:388 #8 0x0066a1fc in ExecInsert (mtstate=0x2b99e50, slot=0x2b9a2c0, planSlot=0x2b9a2c0, arbiterIndexes=0x0, onconflict=ONCONFLICT_NONE, estate=0x2b99c00, canSetTag=1 '\001') at nodeModifyTable.c:481 #9 0x0066b841 in ExecModifyTable (node=0x2b99e50) at nodeModifyTable.c:1496 #10 0x00645e51 in ExecProcNode (node=0x2b99e50) at execProcnode.c:396 #11 0x006424d9 in ExecutePlan (estate=0x2b99c00, planstate=0x2b99e50, use_parallel_mode=0 '\000', operation=CMD_INSERT, sendTuples=0 '\000', numberTuples=0, direction=ForwardScanDirection, dest=0xd73c20 ) at execMain.c:1567 #12 0x0064087a in standard_ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at execMain.c:338 #13 0x7f55fe590605 in pgss_ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877 #14 0x00640751 in ExecutorRun (queryDesc=0x2b89c70, direction=ForwardScanDirection, count=0) at execMain.c:284 #15 0x007c331e in ProcessQuery (plan=0x2b873f0, sourceText=0x2b89bd0 "insert into foo (index) select $1 from generate_series(1,1)", params=0x2b89c20, dest=0xd73c20 , completionTag=0x7ffdf540a3f0 "") at pquery.c:187 #16 0x007c4a0c in PortalRunMulti (portal=0x2ae7930, isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xd73c20 , altdest=0xd73c20 , completionTag=0x7ffdf540a3f0 "") at pquery.c:1303 #17 0x007c4055 in PortalRun (portal=0x2ae7930, count=9223372036854775807, isTopLevel=1 '\001', dest=0x2b5dc90, altdest=0x2b5dc90, completionTag=0x7ffdf540a3f0 "") at pquery.c:815 #18 0x007bfb45 in exec_execute_message (portal_name=0x2b5d880 "", max_rows=9223372036854775807) at postgres.c:1977 #19 0x007c25c7 in PostgresMain (argc=1, argv=0x2b05df8, dbname=0x2aca3c0 "postgres", username=0x2b05de0 "edb") at postgres.c:4133 #20 0x00744d8f in BackendRun (port=0x2aefa60) at postmaster.c:4260 #21 0x00744523 in BackendStartup (port=0x2aefa60) at postmaster.c:3934 #22 0x00740f9c in ServerLoop () at postmaster.c:1691 #23 0x00740623 in PostmasterMain (argc=4, argv=0x2ac81f0) at postmaster.c:1299 #24 0x006940fe in main (argc=4, argv=0x2ac81f0) at main.c:228 Please let me know for any further inputs. [1]- https://www.postgresql.org/message-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z%2B2Tp8vb2Kmxu%3D6zg%40mail.gmail.com With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Wed, Sep 14, 2016 at 2:45 PM, Ashutosh Sharma wrote: > Hi All, > > I am getting following error when running the test script shared by > Jeff -[1] . The error is observed upon executing the test script for > around 3-4 hrs. > > 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR: lock > buffer_content 1 is not held > 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT: insert into > foo (index) select $1 from generate_series(1,1) > > 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR: lock > buffer_content 10 is not held > 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT: insert > into foo (index) select $1 from generate_series(1,1) > > 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR: lock > buffer_content 10 is not held > 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT: insert > into foo (index) select $1 from generate_series(1,1) > > [1]- > https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gma
Re: [HACKERS] pageinspect: Hash index support
Hi, I am getting a fatal error along with some warnings when trying to apply the v3 patch using "git apply" command. [ashu@localhost postgresql]$ git apply 0001-pageinspect-Hash-index-support_v3.patch 0001-pageinspect-Hash-index-support_v3.patch:29: trailing whitespace. brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES) 0001-pageinspect-Hash-index-support_v3.patch:36: trailing whitespace. DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \ 0001-pageinspect-Hash-index-support_v3.patch:37: trailing whitespace. pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \ 0001-pageinspect-Hash-index-support_v3.patch:38: trailing whitespace. pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \ 0001-pageinspect-Hash-index-support_v3.patch:39: trailing whitespace. pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql fatal: git apply: bad git-diff - expected /dev/null on line 46 Also, i think the USAGE for hash_metap() is refering to hash_metap_bytea(). Please correct it. I have just started reviewing the patch, will keep on posting my comments upon further review. Thanks. With Regards, Ashutosh Sharma. EnterpriseDB: http://www.enterprisedb.com On Tue, Sep 20, 2016 at 10:15 PM, Jeff Janes wrote: > On Tue, Sep 20, 2016 at 5:40 AM, Jesper Pedersen > wrote: >> >> On 09/20/2016 03:19 AM, Michael Paquier wrote: >>> >>> You did not get right the comments from Alvaro upthread. The following >>> functions are added with this patch: >>> function hash_metap(text) >>> function hash_metap_bytea(bytea) >>> function hash_page_items(text,integer) >>> function hash_page_items_bytea(bytea) >>> function hash_page_stats(text,integer) >>> function hash_page_stats_bytea(bytea,integer) >>> >>> Now the following set of functions would be sufficient: >>> function hash_metapage_info(bytea) >>> function hash_page_items(bytea) >>> function hash_page_stats(bytea) >>> The last time pageinspect has been updated, when BRIN functions have >>> been added, it has been discussed to just use (bytea) as an argument >>> interface and just rely on get_raw_page() to get the pages wanted, so >>> I think that we had better stick with that and keep things simple. >>> >> >> Yes, I know, Alvaro and you voted for the bytea methods, and Jeff asked >> for both. >> >> Attached is v3 with only the bytea based methods. >> >> Alvaro, Michael and Jeff - Thanks for the review ! > > > > Is the 2nd "1" in this call needed? > > SELECT * FROM hash_page_stats(get_raw_page('mytab_index', 1), 1) > > As far as I can tell, that argument is only used to stuff into the output > field "blkno", it is not used to instruct the interpretation of the raw page > itself. It doesn't seem worthwhile to have the parameter that only echos > back to the user what the user already put in (twice). The only existing > funtions which take the blkno argument are those that don't use the > get_raw_page style. > > Also, should we document what the single letter values mean in the > hash_page_stats.type column? It is not obvious that 'i' means bitmap, for > example. > > Cheers, > > Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
bucket page? I guess we can't have such type of hash page and incase if we found such type of undefined page i guess we should error out instead of reading the page because it is quite possible that the page would be corrupted. Please let me know your thoughts on this. 5). I think we have added enough functions to show the page level statistics but not the index level statistics like the total number of overflow pages , bucket pages, number of free overflow pages, number of bitmap pages etc. in the hash index. How about adding a function that would store the index level statistics? With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Hi, > git apply w/ v4 works here, so you will have to investigate what happens on > your side. > Thanks, It works with v4 patch. > As these functions are marked as superuser only it is expected that people > provides the correct input, especially since the raw page structure is used > as the input. Well, page_stats / page_items does accept bitmap page as input but these function are not defined to read bitmap page as bitmap page doesnot have a standard page layout. Infact if we pass bitmap page as an input to page_stats / page_items, the output is always same. I think we need to have a separete function to read bitmap page. And i am not sure why should a server crash if i pass meta page to hash_page_stats or hash_page_items. > The "if" statement will need updating once the CHI patch goes in, as it > defines new constants for page types. Not sure if CHI patch is adding a new type of hash page other than holding an information about split in progress. Basically my point was can we have hash page of types other than meta page, bucket page, overflow and bitmap page. If pageinspect tool finds a page that doesnot fall under any of these category shouldn't it error out. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Write Ahead Logging for Hash Indexes
Hi All, > I forgot to mention that Ashutosh has tested this patch for a day > using Jeff's tool and he didn't found any problem. Also, he has found > a way to easily reproduce the problem. Ashutosh, can you share your > changes to the script using which you have reproduce the problem? I made slight changes in Jeff Janes patch (crash_REL10.patch) to reproduce the issue reported by him earlier [1]. I changed the crash_REL10.patch such that a torn page write happens only for a bitmap page in hash index. As described by Amit in [2] & [3] we were not logging full page image for bitmap page and that's the reason we were not be able to restore bitmap page in case it is a torn page which would eventually result in below error. 38422 01000 2016-09-19 16:25:50.061 PDT:WARNING: page verification failed, calculated checksum 65067 but expected 21260 38422 01000 2016-09-19 16:25:50.061 PDT:CONTEXT: xlog redo at 3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T 38422 XX001 2016-09-19 16:25:50.071 PDT:FATAL: invalid page in block 9 of relation base/16384/17334 Jeff Jane's original patch had following conditions for a torn page write: if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress()) nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3); Ashutosh has added some more condition's in above if( ) to ensure that the issue associated with bitmap page gets reproduced easily: if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress() && bucket_opaque->hasho_page_id == HASHO_PAGE_ID && bucket_opaque->hasho_flag & LH_BITMAP_PAGE) nbytes = FileWrite(v->mdfd_vfd, buffer, 24); Also, please note that i am allowing only 24 bytes of data i.e. just a page header to be written into a disk file so that even if a single overflow page is added into hash index table the issue will be observed. Note: You can find Jeff Jane's patch for torn page write at - [4]. [1] - https://www.postgresql.org/message-id/CAMkU%3D1zGcfNTWWxnud8j_p0vb1ENGcngkwqZgPUEnwZmAN%2BXQw%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1LmQZGnYhSHXDDCOsSb_0U-gsxReEmSDRgCZr%3DAdKbTEg%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAA4eK1%2Bk9tGPw7vOACRo%2B4h5n%3DutHnSuGgMoJrLANybAjNBW9w%40mail.gmail.com [4] - https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pg_basebackup creates a corrupt file for pg_stat_tmp and pg_replslot on a backup location
Hi Peter, > I just wanted to update you, I have taken this commit fest entry patch > to review because I think it will be addresses as part of "Exclude > additional directories in pg_basebackup", which I'm also reviewing. > Therefore, I'm not actually planning on discussing this patch further. > Please correct me if this assessment does not match your expectations. Thanks for the update. I am absolutely OK with it. I feel it would be a good idea to review "Exclude additional directories in pg_basebackup" which also addresses the issue reported by me. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] dead or outdated URLs found in win32.h
Hi All, The following URLs present in src/include/port/win32.h to share the information on dllimport or dllexport (used in windows) seems to be either dead/obsolete and i think, they need to be updated. http://support.microsoft.com/kb/132044 -- dead http://msdn.microsoft.com/en-us/library/8fskxacy(v=vs.80).aspx -- outdated http://msdn.microsoft.com/en-us/library/a90k134d(v=vs.80).aspx -- outdated https://msdn.microsoft.com/en-gb/library/aa489609.aspx -- outdated -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Microvacuum support for Hash Index
Hi All, I have added a microvacuum support for hash index access method and attached is the v1 patch for the same. The patch basically takes care of the following things: 1. Firstly, it changes the marking of dead tuples from 'tuple-at-a-time' to 'page-at-a-time' during hash index scan. For this we accumulate the heap tids and offset of all the hash index tuples if it is pointed by kill_prior_tuple during scan and then mark all accumulated tids as LP_DEAD either while stepping from one page to another (assuming the scan in both forward and backward direction) or during end of the hash index scan or during rescan. 2. Secondly, when inserting tuple into hash index table, if not enough space is found on a current page then it ensures that we first clean the dead tuples if found in the current hash index page before moving to the next page in a bucket chain or going for a bucket split. This basically increases the page reusability and reduces the number of page splits, thereby reducing the overall size of hash index table. I have compared the hash index size with and without my patch (microvacuum_hash_index_v1.patch attached with this mail) on a high end machine at various scale factors and the results are shown below. For testing this, i have created hash index (pgbench_accounts_aid) on aid column of 'pgbench_accounts' table instead of primary key and the results shown in below table are for the same. The patch (pgbench.patch) having these changes is also attached with this mail. Moreover, I am using my own script file (file_hash_kill_prior_tuple) for updating the index column with pgbench read-write command. The script file 'file_hash_kill_prior_tuple' is also attached with this mail. Here are some initial test results showing the benefit of this patch: postgresql.conf and pgbench settings: autovacuum=off client counts = 64 run time duration = 15 mins ./pgbench -c $threads -j $threads -T 900 postgres -f ~/file_hash_kill_prior_tuple Scale Factorhash index size @ start HEADHEAD + Patch 10 32 MB 579 MB 158 MB 50 128 MB 630 MB 350 MB 100 256 MB 1255 MB 635 MB 300 1024 MB2233 MB 1093 MB As shown in above result, at 10 scale factor the hash index size has reduced by almost 4 times whereas at 50 and 300 scale factors it has reduced by half with my patch. This basically proves that we can reduce the hash index size to a good extent with this patch. System specifications: - Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 Thread(s) per core:2 Core(s) per socket:8 Socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel Note: The patch (microvacuum_hash_index_v1.patch) is prepared on top of concurrent_hash_index_v8.patch-[1] and wal_hash_index_v5.1.patch[2] for hash index. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BX%3D8sUd1UCZDZnE3D9CGi9kw%2Bkjxp2Tnw7SX5w8pLBNw%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1KE%3D%2BkkowyYD0vmch%3Dph4ND3H1tViAB%2B0cWTHqjZDDfqg%40mail.gmail.com diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index db73f05..a0720ef 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -325,14 +325,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir) if (scan->kill_prior_tuple) { /* - * Yes, so mark it by setting the LP_DEAD state in the item flags. + * Yes, so remember it for later. (We'll deal with all such + * tuples at once right after leaving the index page or at + * end of scan.) */ - ItemIdMarkDead(PageGetItemId(page, offnum)); + if (so->killedItems == NULL) +so->killedItems = palloc(MaxIndexTuplesPerPage * + sizeof(HashScanPosItem)); - /* - * Since this can be redone later if needed, mark as a hint. - */ - MarkBufferDirtyHint(buf, true); + if (so->numKilled < MaxIndexTuplesPerPage) + { +so->killedItems[so->numKilled].heapTid = so->hashso_heappos; +so->killedItems[so->numKilled].indexOffset = + ItemPointerGetOffsetNumber(&(so->hashso_curpos)); +so->numKilled++; + } } /* @@ -439,6 +446,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys) so->hashso_skip_moved_tuples = false; + so->killedItems = NULL; + so->numKilled = 0; + scan->opaque = so; return scan; @@ -454,6 +464,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys, HashScanOpaque so = (HashScanOpaque) scan->opaque; Relation rel = scan->indexRelation; + /* Before leaving current page, deal with any killed items */ + if (so->numKilled > 0) + hashkillitems(scan); + _hash_dropscanbuf(rel, so); /* set p
Re: [HACKERS] Microvacuum support for Hash Index
Hi Amit, Thanks for showing your interest and reviewing my patch. I have started looking into your review comments. I will share the updated patch in a day or two. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com On Fri, Oct 28, 2016 at 4:42 PM, Amit Kapila wrote: > On Mon, Oct 24, 2016 at 2:21 PM, Ashutosh Sharma > wrote: >> Hi All, >> >> I have added a microvacuum support for hash index access method and >> attached is the v1 patch for the same. >> > > This is an important functionality for hash index as we already do > have same functionality for other types of indexes like btree. > >> The patch basically takes care >> of the following things: >> >> 1. Firstly, it changes the marking of dead tuples from >> 'tuple-at-a-time' to 'page-at-a-time' during hash index scan. For this >> we accumulate the heap tids and offset of all the hash index tuples if >> it is pointed by kill_prior_tuple during scan and then mark all >> accumulated tids as LP_DEAD either while stepping from one page to >> another (assuming the scan in both forward and backward direction) or >> during end of the hash index scan or during rescan. >> >> 2. Secondly, when inserting tuple into hash index table, if not enough >> space is found on a current page then it ensures that we first clean >> the dead tuples if found in the current hash index page before moving >> to the next page in a bucket chain or going for a bucket split. This >> basically increases the page reusability and reduces the number of >> page splits, thereby reducing the overall size of hash index table. >> > > Few comments on patch: > > 1. > +static void > +hash_xlog_vacuum_one_page(XLogReaderState *record) > +{ > + XLogRecPtr lsn = record->EndRecPtr; > + xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record); > + Buffer bucketbuf = InvalidBuffer; > + Buffer buffer; > + Buffer metabuf; > + Page page; > + XLogRedoAction action; > > While replaying the delete/vacuum record on standby, it can conflict > with some already running queries. Basically the replay can remove > some row which can be visible on standby. You need to resolve > conflicts similar to what we do in btree delete records (refer > btree_xlog_delete). > > 2. > + /* > + * Write-lock the meta page so that we can decrement > + * tuple count. > + */ > + _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE); > + > + _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf, > + (buf == bucket_buf) ? true : false); > + > + _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); > > It seems here meta page lock is acquired for duration more than > required and also it is not required when there are no deletable items > on page. You can take the metapage lock before decrementing the count. > > 3. > Assert(offnum <= maxoff); > + > > Spurious space. There are some other similar spurious white space > changes in patch, remove them as well. > > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi, > While replaying the delete/vacuum record on standby, it can conflict > with some already running queries. Basically the replay can remove > some row which can be visible on standby. You need to resolve > conflicts similar to what we do in btree delete records (refer > btree_xlog_delete). Agreed. Thanks for putting this point. I have taken care of it in the attached v2 patch. > + /* > + * Write-lock the meta page so that we can decrement > + * tuple count. > + */ > + _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE); > + > + _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf, > + (buf == bucket_buf) ? true : false); > + > + _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); > > It seems here meta page lock is acquired for duration more than > required and also it is not required when there are no deletable items > on page. You can take the metapage lock before decrementing the count. Ok. Corrected. Please refer to the attached v2 patch. > Spurious space. There are some other similar spurious white space > changes in patch, remove them as well. Corrected. Please refer attached v2 patch. diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index db73f05..4a4d614 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -157,7 +157,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo) if (buildstate.spool) { /* sort the tuples and insert them into the index */ - _h_indexbuild(buildstate.spool); + _h_indexbuild(buildstate.spool, heap->rd_node); _h_spooldestroy(buildstate.spool); } @@ -196,6 +196,8 @@ hashbuildCallback(Relation index, Datum index_values[1]; bool index_isnull[1]; IndexTuple itup; + Relation rel; + RelFileNode rnode; /* convert data to a hash key; on failure, do not insert anything */ if (!_hash_convert_tuple(index, @@ -212,8 +214,12 @@ hashbuildCallback(Relation index, /* form an index tuple and point it at the heap tuple */ itup = index_form_tuple(RelationGetDescr(index), index_values, index_isnull); + /* Get RelfileNode from relation OID */ + rel = relation_open(htup->t_tableOid, NoLock); + rnode = rel->rd_node; + relation_close(rel, NoLock); itup->t_tid = htup->t_self; - _hash_doinsert(index, itup); + _hash_doinsert(index, itup, rnode); pfree(itup); } @@ -245,7 +251,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull, itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull); itup->t_tid = *ht_ctid; - _hash_doinsert(rel, itup); + _hash_doinsert(rel, itup, heapRel->rd_node); pfree(itup); @@ -325,14 +331,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir) if (scan->kill_prior_tuple) { /* - * Yes, so mark it by setting the LP_DEAD state in the item flags. + * Yes, so remember it for later. (We'll deal with all such + * tuples at once right after leaving the index page or at + * end of scan.) */ - ItemIdMarkDead(PageGetItemId(page, offnum)); + if (so->killedItems == NULL) +so->killedItems = palloc(MaxIndexTuplesPerPage * + sizeof(HashScanPosItem)); - /* - * Since this can be redone later if needed, mark as a hint. - */ - MarkBufferDirtyHint(buf, true); + if (so->numKilled < MaxIndexTuplesPerPage) + { +so->killedItems[so->numKilled].heapTid = so->hashso_heappos; +so->killedItems[so->numKilled].indexOffset = + ItemPointerGetOffsetNumber(&(so->hashso_curpos)); +so->numKilled++; + } } /* @@ -439,6 +452,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys) so->hashso_skip_moved_tuples = false; + so->killedItems = NULL; + so->numKilled = 0; + scan->opaque = so; return scan; @@ -454,6 +470,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys, HashScanOpaque so = (HashScanOpaque) scan->opaque; Relation rel = scan->indexRelation; + /* Before leaving current page, deal with any killed items */ + if (so->numKilled > 0) + hashkillitems(scan); + _hash_dropscanbuf(rel, so); /* set position invalid (this will cause _hash_first call) */ @@ -480,6 +500,10 @@ hashendscan(IndexScanDesc scan) HashScanOpaque so = (HashScanOpaque) scan->opaque; Relation rel = scan->indexRelation; + /* Before leaving current page, deal with any killed items */ + if (so->numKilled > 0) + hashkillitems(scan); + _hash_dropscanbuf(rel, so); pfree(so); @@ -809,6 +833,15 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf, PageIndexMultiDelete(page, deletable, ndeletable); bucket_dirty = true; + /* + * Let us mark the page as clean if vacuum removes the DEAD tuples + * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES + * flag. + */ + if (tuples_removed && *tuples_removed > 0 && +opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES) +opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; + MarkBufferDirty(buf); /* XLOG stuff */ diff --git a/src/backend/acce
Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.
Hi, I have started with the review for this patch and would like to share some of my initial review comments that requires author's attention. 1) I am getting some trailing whitespace errors when trying to apply this patch using git apply command. Following are the error messages i am getting. [edb@localhost postgresql]$ git apply test_pg_stat_statements-v1.patch test_pg_stat_statements-v1.patch:28: trailing whitespace. $(MAKE) -C $(top_builddir)/src/test/regress all test_pg_stat_statements-v1.patch:41: space before tab in indent. $(REGRESSCHECKS) warning: 2 lines add whitespace errors. 2) The test-case you have added is failing that is because queryid is going to be different every time you execute the test-case. I think it would be better to remove the queryid column from "SELECT queryid, query, calls, rows from pg_stat_statements ORDER BY rows". Below is the output i got upon test-case execution. make regresscheck == running regression test queries== test pg_stat_statements ... FAILED == shutting down postmaster == 3) Ok. As you mentioned upthread, I do agree with the fact that we can't add pg_stat_statements tests for installcheck as this module requires shared_preload_libraries to be set. But, if there is an already existing installation with pg_stat_statements module loaded we should allow the test to run on this installation if requested explicitly by the user. I think we can add some target like 'installcheck-force' in the MakeFile that would help the end users to run the test on his own installation. Thoughts? Also, I am changed the status in the commit-fest from "Needs review" to "waiting on Author". On Mon, Sep 19, 2016 at 6:26 PM, Amit Kapila wrote: > On Sat, Sep 10, 2016 at 5:02 PM, Amit Kapila wrote: >> On Tue, Aug 30, 2016 at 8:01 AM, Andres Freund wrote: >>> On 2016-08-30 07:57:19 +0530, Amit Kapila wrote: I will write such a test case either in this week or early next week. >>> >>> Great. >>> >> >> Okay, attached patch adds some simple tests for pg_stat_statements. >> One thing to note is that we can't add pg_stat_statements tests for >> installcheck as this module requires shared_preload_libraries to be >> set. Hope this suffices the need. >> > > Registered this patch for next CF. > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Applying XLR_INFO_MASK correctly when looking at WAL record information
Hi, > What about the patch attached to make things more consistent? I have reviewed this patch. It does serve the purpose and looks sane to me. I am marking it as ready for committer. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.
Hi, > I had also thought about it while preparing patch, but I couldn't find > any clear use case. I think it could be useful during development of > a module, but not sure if it is worth to add a target. I think there > is no harm in adding such a target, but lets take an opinion from > committer or others as well before adding this target. What do you > say? Ok. We can do that. I have verified the updated v2 patch. The patch looks good to me. I am marking it as ready for committer. Thanks. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi Jesper, > Some initial comments. > > _hash_vacuum_one_page: > > + END_CRIT_SECTION(); > + _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK); > > The _hash_chgbufaccess() needs a comment. > > You also need a place where you call pfree for so->killedItems - maybe in > hashkillitems(). Thanks for reviewing this patch. I would like to update you that this patch has got dependency on a patch for concurrent hash index and WAL log in hash index. So, till these two patches for hash index are not stable I won't be able to share you a next version of patch for supporting microvacuum in hash index. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Parallel worker crash on seqscan
Hi, > the following query appears to reliably crash parallel workers on master > as of 0832f2d. > > --8<---cut here---start->8--- > set max_parallel_workers_per_gather to 2; > set force_parallel_mode to 1; > > select subq.context from pg_settings, > lateral (select context from pg_opclass limit 1) as subq > limit 1; > --8<---cut here---end--->8--- As suggested, I have tried to reproduce this issue on *0832f2d* commit but could not reproduce it. I also tried it on the latest commit in master branch and could not reproduce here as well. Amit (included in this email thread) has also tried it once and he was also not able to reproduce it. Could you please let me know if there is something more that needs to be done in order to reproduce it other than what you have shared above. Thanks. With Regards, Ashutosh Sharma. EnterpriseDB: http://www.enterprisedb.com
Re: [HACKERS] Should we cacheline align PGXACT?
Hi, >> On Thu, Feb 16, 2017 at 5:07 AM, Alexander Korotkov >> wrote: >> > On Wed, Feb 15, 2017 at 8:49 PM, Alvaro Herrera >> > >> > wrote: >> >> Alexander Korotkov wrote: >> >> >> >> > Difference between master, pgxact-align-2 and pgxact-align-3 doesn't >> >> > exceed >> >> > per run variation. >> >> >> >> FWIW this would be more visible if you added error bars to each data >> >> point. Should be simple enough in gnuplot ... >> > >> > Good point. >> > Please find graph of mean and errors in attachment. >> >> So ... no difference? > > > Yeah, nothing surprising. It's just another graph based on the same data. > I wonder how pgxact-align-3 would work on machine of Ashutosh Sharma, > because I observed regression there in write-heavy benchmark of > pgxact-align-2. I am yet to benchmark pgxact-align-3 patch on a read-write workload. I could not do it as our benchmarking machine was already reserved for some other test work. But, I am planning to do it on this weekend. Will try to post my results by Monday evening. Thank you and sorry for the delayed response. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
Hi, >> On Thu, Feb 16, 2017 at 5:07 AM, Alexander Korotkov > >> wrote: > >> > On Wed, Feb 15, 2017 at 8:49 PM, Alvaro Herrera > >> > > >> > wrote: > >> >> Alexander Korotkov wrote: > >> >> > >> >> > Difference between master, pgxact-align-2 and pgxact-align-3 > doesn't > >> >> > exceed > >> >> > per run variation. > >> >> > >> >> FWIW this would be more visible if you added error bars to each data > >> >> point. Should be simple enough in gnuplot ... > >> > > >> > Good point. > >> > Please find graph of mean and errors in attachment. > >> > >> So ... no difference? > > > > > > Yeah, nothing surprising. It's just another graph based on the same > data. > > I wonder how pgxact-align-3 would work on machine of Ashutosh Sharma, > > because I observed regression there in write-heavy benchmark of > > pgxact-align-2. > > I am yet to benchmark pgxact-align-3 patch on a read-write workload. I > could not do it as our benchmarking machine was already reserved for > some other test work. But, I am planning to do it on this weekend. > Will try to post my results by Monday evening. Thank you and sorry for > the delayed response. > Following are the pgbench results for read-write workload, I got with pgxact-align-3 patch. The results are for 300 scale factor with 8GB of shared buffer i.e. when data fits into the shared buffer. For 1000 scale factor with 8GB shared buffer the test is still running, once it is completed I will share the results for that as well. *pgbench settings:* pgbench -i -s 300 postgres pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres where, time_for_reading = 30mins *non default GUC param:* shared_buffers=8GB max_connections=300 pg_xlog is located in SSD. *Machine details:* Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 Thread(s) per core:2 Core(s) per socket:8 Socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family:6 Model: 47 Model name:Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 4 4283 4220 -1.47093159 8 7291 7261 -0.4114661912 16 11909 12149 2.015282559 32 20789 20745 -0.211650392 64 28412 27831 -2.044910601 128 29369 28765 -2.056590282 156 27949 27189 -2.719238613 180 27876 27171 -2.529057254 196 28849 27872 -3.386599189 256 30321 28188 -7.034728406 Also, Excel sheet (results-read-write-300-SF) containing the results for all the 3 runs is attached. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com results-read-write-300-SF.xlsx Description: MS-Excel 2007 spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect: Hash index support
Thanks for reporting it. This is because of incorrect data typecasting. Attached is the patch that fixes this issue. On Tue, Feb 21, 2017 at 2:58 PM, Mithun Cy wrote: > On Fri, Feb 10, 2017 at 1:06 AM, Robert Haas > wrote: > > > Alright, committed with a little further hacking. > > I did pull the latest code, and tried > Test: > > create table t1(t int); > create index i1 on t1 using hash(t); > insert into t1 select generate_series(1, 1000); > > postgres=# SELECT spares FROM hash_metapage_info(get_raw_page('i1', 0)); >spares > > > {0,0,0,1,9,17,33,65,-127,1,1,0,1,-1,-1,-4,0,0,0,0,0,0,0,0, > 0,0,0,0,0,0,0,0} > > spares are showing negative numbers; I think the wrong type has been > chosen, seems it is rounding at 127, spares are defined as following > uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* spare pages before each > splitpoint */ > > it should be always positive. > > -- > Thanks and Regards > Mithun C Y > EnterpriseDB: http://www.enterprisedb.com > typecast_fix.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
On Tue, Feb 21, 2017 at 4:21 PM, Alexander Korotkov wrote: > > Hi, Ashutosh! > On Mon, Feb 20, 2017 at 1:20 PM, Ashutosh Sharma > wrote: >> >> Following are the pgbench results for read-write workload, I got with >> pgxact-align-3 patch. The results are for 300 scale factor with 8GB of >> shared buffer i.e. when data fits into the shared buffer. For 1000 scale >> factor with 8GB shared buffer the test is still running, once it is >> completed I will share the results for that as well. >> >> pgbench settings: >> pgbench -i -s 300 postgres >> pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres >> >> where, time_for_reading = 30mins >> >> non default GUC param: >> shared_buffers=8GB >> max_connections=300 >> >> pg_xlog is located in SSD. > > > Thank you for testing. > It seems that there is still regression. While padding was reduced from 116 > bytes to 4 bytes, it makes me think that probably there is something wrong in > testing methodology. > Are you doing re-initdb and pgbench -i before each run? I would ask you to > do the both. > Also, if regression would still exist, let's change the order of versions. > Thus, if you run master before patched version, I would ask you to run > patched version before master. Yes, there is still some regression however it has come down by a small margin. I am not doing initdb for each run instead I am doing, dropdb-->createdb-->pgbench -i. Is dropping old database and creating a new one for every run not okay, Do I have to do initdb every time. Okay, I can change the order of reading and let you know the results. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
On Tue, Feb 21, 2017 at 5:52 PM, Alexander Korotkov wrote: > On Tue, Feb 21, 2017 at 2:37 PM, Andres Freund wrote: >> >> On 2017-02-21 16:57:36 +0530, Ashutosh Sharma wrote: >> > Yes, there is still some regression however it has come down by a >> > small margin. I am not doing initdb for each run instead I am doing, >> > dropdb-->createdb-->pgbench -i. Is dropping old database and creating >> > a new one for every run not okay, Do I have to do initdb every time. >> > Okay, I can change the order of reading and let you know the results. >> >> That does make a difference. Primarily because WAL writes in a new >> cluster are a more expensive than in an old one, because of segment >> recycling. > > > +1 okay, understood. Will take care of it when I start with next round of testing. Thanks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
Okay. As suggested by Alexander, I have changed the order of reading and doing initdb for each pgbench run. With these changes, I got following results at 300 scale factor with 8GB of shared buffer. *pgbench settings:* pgbench -i -s 300 postgres pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres where, time_for_reading = 30mins *non default GUC param* shared_buffers=8GB max_connections=300 pg_xlog is located in SSD. CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 4 2803 2843 1.427042455 8 5315 5225 -1.69332079 32 19755 19669 -0.4353328271 64 28679 27980 -2.437323477 128 28887 28008 -3.042891266 156 27465 26728 -2.683415256 180 27425 26697 -2.654512306 196 28826 27396 -4.960799278 256 29787 28107 -5.640044315 The test machine details are as follows, *Machine details:* Architecture: x86_64 CPU op-mode(s):32-bit, 64-bit Byte Order:Little Endian CPU(s):128 On-line CPU(s) list: 0-127 Thread(s) per core:2 Core(s) per socket:8 Socket(s): 8 NUMA node(s): 8 Vendor ID: GenuineIntel CPU family:6 Model: 47 Model name:Intel(R) Xeon(R) CPU E7- 8830 @ 2.13GHz Also, Excel sheet (results-readwrite-300-SF) containing the results for all the 3 runs is attached. -- With Regards, Ashutosh Sharma EnterpriseDB:*http://www.enterprisedb.com <http://www.enterprisedb.com/>* On Thu, Feb 23, 2017 at 2:44 AM, Alvaro Herrera wrote: > I wonder if this "perf c2c" tool with Linux 4.10 might be useful in > studying this problem. > https://joemario.github.io/blog/2016/09/01/c2c-blog/ > > -- > Álvaro Herrerahttps://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > results-readwrite-300-SF.xlsx Description: MS-Excel 2007 spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
On Fri, Feb 24, 2017 at 10:41 AM, Simon Riggs wrote: > On 24 February 2017 at 04:41, Ashutosh Sharma wrote: >> >> Okay. As suggested by Alexander, I have changed the order of reading and >> doing initdb for each pgbench run. With these changes, I got following >> results at 300 scale factor with 8GB of shared buffer. > > > Would you be able to test my patch also please? > Sure, I can do that and share the results by early next week. Thanks. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
Hi, On Fri, Feb 24, 2017 at 12:22 PM, Ashutosh Sharma wrote: On Fri, Feb 24, 2017 at 10:41 AM, Simon Riggs wrote: > > On 24 February 2017 at 04:41, Ashutosh Sharma > wrote: > >> > >> Okay. As suggested by Alexander, I have changed the order of reading and > >> doing initdb for each pgbench run. With these changes, I got following > >> results at 300 scale factor with 8GB of shared buffer. > > > > > > Would you be able to test my patch also please? > > > > Sure, I can do that and share the results by early next week. Thanks. > So, Here are the pgbench results I got with ' *reduce_pgxact_access_AtEOXact.v2.patch*' on a read-write workload. *pgbench settings:* pgbench -i -s 300 postgres pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres where, time_for_reading = 30mins *non default GUC param* shared_buffers=8GB max_connections=300 pg_wal is located in SSD. CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 4 2588 2601 0.5023183926 8 5094 5098 0.0785237534 16 10294 10307 0.1262871576 32 19779 19815 0.182011224 64 27908 28346 1.569442454 72 27823 28416 2.131330194 128 28455 28618 0.5728342998 180 26739 26879 0.5235797898 196 27820 27963 0.5140186916 256 28763 28969 0.7161978931 Also, Excel sheet (results-readwrite-300-SF.xlsx) containing the results for all the 3 runs is attached. results-readwrite-300-SF_EOXACT.xlsx Description: MS-Excel 2007 spreadsheet -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
On Tue, Feb 28, 2017 at 11:44 PM, Simon Riggs wrote: > On 28 February 2017 at 11:34, Ashutosh Sharma > wrote: > > >> So, Here are the pgbench results I got with ' >> *reduce_pgxact_access_AtEOXact.v2.patch*' on a read-write workload. >> > > Thanks for performing a test. > > I see a low yet noticeable performance gain across the board on that > workload. > > That is quite surprising to see a gain on that workload. The main workload > we have been discussing was the full read-only test (-S). For that case the > effect should be much more noticeable based upon Andres' earlier comments. > > Would it be possible to re-run the test using only the -S workload? Thanks > very much. > Okay, I already had the results for read-oly workload but just forgot to share it along with the results for read-write test. Here are the results for read-only test, CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 4 26322 31259 18.75617354 8 63499 69472 9.406447346 16 181155 186534 2.96928045 32 333504 337533 1.208081462 64 350341 353747 0.9721956608 72 366339 373898 2.063389374 128 443381 478562 7.934710779 180 299875 334118 11.41909129 196 269194 275525 2.351835479 256 220027 235995 7.257291151 The pgbench settings and non-default params are, pgbench -i -s 300 postgres pgbench -M prepared -c $thread -j $thread -T $time_for_reading -S postgres where, time_for_reading = 10mins *non default param:* shared_buffers=8GB max_connections=300 -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com > >> *pgbench settings:* >> pgbench -i -s 300 postgres >> pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres >> >> where, time_for_reading = 30mins >> >> *non default GUC param* >> shared_buffers=8GB >> max_connections=300 >> >> pg_wal is located in SSD. >> >> CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT >> 4 2588 2601 0.5023183926 <%28502%29%20318-3926> >> 8 5094 5098 0.0785237534 >> 16 10294 10307 0.1262871576 >> 32 19779 19815 0.182011224 >> 64 27908 28346 1.569442454 >> 72 27823 28416 2.131330194 >> 128 28455 28618 0.5728342998 >> 180 26739 26879 0.5235797898 >> 196 27820 27963 0.5140186916 >> 256 28763 28969 0.7161978931 >> >> Also, Excel sheet (results-readwrite-300-SF.xlsx) containing the results >> for all the 3 runs is attached. >> > > > > -- > Simon Riggshttp://www.2ndQuadrant.com/ > <http://www.2ndquadrant.com/> > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services >
Re: [HACKERS] Should we cacheline align PGXACT?
On Wed, Mar 1, 2017 at 5:29 PM, Simon Riggs wrote: > > On 1 March 2017 at 04:50, Ashutosh Sharma wrote: >> >> On Tue, Feb 28, 2017 at 11:44 PM, Simon Riggs wrote: >>> >>> On 28 February 2017 at 11:34, Ashutosh Sharma wrote: >>> >>>> >>>> So, Here are the pgbench results I got with >>>> 'reduce_pgxact_access_AtEOXact.v2.patch' on a read-write workload. >>> >>> >>> Thanks for performing a test. >>> >>> I see a low yet noticeable performance gain across the board on that >>> workload. >>> >>> That is quite surprising to see a gain on that workload. The main workload >>> we have been discussing was the full read-only test (-S). For that case the >>> effect should be much more noticeable based upon Andres' earlier comments. >>> >>> Would it be possible to re-run the test using only the -S workload? Thanks >>> very much. >> >> >> Okay, I already had the results for read-oly workload but just forgot to >> share it along with the results for read-write test. Here are the results >> for read-only >> test, > > > Thanks for conducting those comparisons. > > So we have variable and sometimes significant gains, with no regressions. > > By the shape of the results this helps in different places to the alignment > patch. Do we have evidence to commit both? Well, We have seen some regression in read-write test with pgxact alignment patch - [1]. I may need to run the test with both the patches to see the combined effect on performance. [1] - https://www.postgresql.org/message-id/CAE9k0Pk%2BrCuNY%2B7O5XwVXHPuki9t8%3DM7jr4kevxw-hdkpFhS2A%40mail.gmail.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Parallel seq. plan is not coming against inheritance or partition table
Hi All, >From following git commit onwards, parallel seq scan is never getting selected for inheritance or partitioned tables. commit 51ee6f3160d2e1515ed6197594bda67eb99dc2cc Author: Robert Haas Date: Wed Feb 15 13:37:24 2017 -0500 Replace min_parallel_relation_size with two new GUCs. Steps to reproduce: == create table t1 (a integer); create table t1_1 (check (a >=1 and a < 100)) inherits (t1); create table t1_2 (check (a >= 100 and a < 200)) inherits (t1); insert into t1_1 select generate_series(1, 90); insert into t1_2 select generate_series(100, 190); analyze t1; analyze t1_1; analyze t1_2; explain analyze select * from t1 where a < 5 OR a > 195; EXPLAIN ANALYZE output: 1) Prior to "Replace min_parallel_relation_size with two new GUCs" commit, postgres=# explain analyze select * from t1 where a < 5 OR a > 195; QUERY PLAN --- Gather (cost=1000.00..25094.71 rows=48787 width=4) (actual time=0.431..184.264 rows=4 loops=1) Workers Planned: 2 Workers Launched: 2 -> Append (cost=0.00..19216.01 rows=20328 width=4) (actual time=0.025..167.465 rows=1 loops=3) -> Parallel Seq Scan on t1 (cost=0.00..0.00 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=3) Filter: ((a < 5) OR (a > 195)) -> Parallel Seq Scan on t1_1 (cost=0.00..9608.00 rows=20252 width=4) (actual time=0.023..76.644 rows=1 loops=3) Filter: ((a < 5) OR (a > 195)) Rows Removed by Filter: 283334 -> Parallel Seq Scan on t1_2 (cost=0.00..9608.01 rows=75 width=4) (actual time=89.505..89.505 rows=0 loops=3) Filter: ((a < 5) OR (a > 195)) Rows Removed by Filter: 30 Planning time: 0.343 ms Execution time: 188.624 ms (14 rows) 2) From "Replace min_parallel_relation_size with two new GUCs" commit onwards, postgres=# explain analyze select * from t1 where a < 5 OR a > 195; QUERY PLAN -- Append (cost=0.00..34966.01 rows=50546 width=4) (actual time=0.021..375.747 rows=4 loops=1) -> Seq Scan on t1 (cost=0.00..0.00 rows=1 width=4) (actual time=0.004..0.004 rows=0 loops=1) Filter: ((a < 5) OR (a > 195)) -> Seq Scan on t1_1 (cost=0.00..17483.00 rows=50365 width=4) (actual time=0.016..198.393 rows=4 loops=1) Filter: ((a < 5) OR (a > 195)) Rows Removed by Filter: 850001 -> Seq Scan on t1_2 (cost=0.00..17483.01 rows=180 width=4) (actual time=173.310..173.310 rows=0 loops=1) Filter: ((a < 5) OR (a > 195)) Rows Removed by Filter: 91 Planning time: 0.812 ms Execution time: 377.831 ms (11 rows) RCA: >From "Replace min_parallel_relation_size with two new GUCs" commit onwards, we are not assigning parallel workers for the child rel with zero heap pages. This means we won't be able to create a partial append path as this requires all the child rels within an Append Node to have a partial path. Please check the following code snippet from set_append_rel_pathlist(). /* Same idea, but for a partial plan. */ if (childrel->partial_pathlist != NIL) partial_subpaths = accumulate_append_subpath(partial_subpaths, linitial(childrel->partial_pathlist)); else partial_subpaths_valid = false; . . /* * Consider an append of partial unordered, unparameterized partial paths. */ if (partial_subpaths_valid) { ... ... /* Generate a partial append path. */ appendpath = create_append_path(rel, partial_subpaths, NULL, parallel_workers); add_partial_path(rel, (Path *) appendpath); } In short, no Gather path would be generated if baserel having an Append Node contains any child rel without partial path. This means just because one childRel having zero heap pages didn't get parallel workers other childRels that was good enough to go for Parallel Seq Scan had to go for normal seq scan which could be costlier. Fix: Attached is the patch that fixes this issue. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com assign_par_workers_for_empty_childRel_v1.patch Description: invalid/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel seq. plan is not coming against inheritance or partition table
> > Right, but OTOH, if we assign parallel workers by default, then it is > quite possible that it would result in much worse plans. Consider a > case where partition hierarchy has 1000 partitions and only one of > them is big enough to allow parallel workers. Now in this case, with > your proposed fix it will try to scan all the partitions in parallel > workers which I think can easily result in bad performance. Right. But, there can also be a case where 999 partitions are large and eligible for PSS. In such case as well, PSS won't be selected. I think > the right way to make such plans parallel is by using Parallel Append > node (https://commitfest.postgresql.org/13/987/). Alternatively, if > you want to force parallelism in cases like the one you have shown in > example, you can use Alter Table .. Set (parallel_workers = 1). Okay, I was not aware of Parallel Append. Thanks. With Regards, Ashutosh Sharma EnterpriseDB: http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel seq. plan is not coming against inheritance or partition table
> I also think that commit didn't intend to change the behavior, > however, the point is how sensible is it to keep such behavior after > Parallel Append. I am not sure if this is the right time to consider > it or shall we wait till Parallel Append is committed. > >> I think the problem here is that compute_parallel_worker() thinks it >> can use 0 as a sentinel value that means "ignore this", but it can't >> really, because a heap can actually contain 0 pages. Here's a patch >> which does the following: >> > > - if (heap_pages < (BlockNumber) min_parallel_table_scan_size && > - index_pages < (BlockNumber) min_parallel_index_scan_size && > - rel->reloptkind == RELOPT_BASEREL) > + if (rel->reloptkind == RELOPT_BASEREL && > + ((heap_pages >= 0 && heap_pages < min_parallel_table_scan_size) || > + (index_pages >= 0 && index_pages < min_parallel_index_scan_size))) > return 0; > > The purpose of considering both heap and index pages () for > min_parallel_* is that even if one of them is bigger than threshold > then we should try parallelism. Yes, But, this is only true for normal tables not for partitioned or inheritance tables. I think for partition table, even if one heap page exist, we go for parallelism. This could be helpful for parallel > index scans where we consider parallel workers based on both heap and > index pages. Is there a reason for you to change that condition as > that is not even related to the problem being discussed? > I think he has changed to allow parallelism for inheritance or partition tables. For normal tables, it won't be touched until the below if-condition is satisfied. *if (heap_pages < (BlockNumber) min_parallel_table_scan_size && index_pages < (BlockNumber) min_parallel_index_scan_size && rel->reloptkind == RELOPT_BASEREL)return 0;* -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: [HACKERS] increasing the default WAL segment size
Hi, I took a look at this patch. Overall, the patch looks good to me. However, there are some review comments that I would like to share, 1. I think the macro 'PATH_MAX' used in pg_waldump.c file is specific to Linux. It needs to be changed to some constant value or may be MAXPGPATH inorder to make it platform independent. 2. As already mentioned by Jim and Kuntal upthread, you are trying to detect the configured WAL segment size in pg_waldump.c and pg_standby.c files based on the size of the random WAL file which doesn't look like a good idea. But, then I think the only option we have is to pass the location of pg_control file to pg_waldump module along with the start and end wal segments. 3. When trying to compile '02-initdb-walsegsize-v2.patch' on Windows, I got this warning message, Warning1warning C4005: 'DEFAULT_XLOG_SEG_SIZE' : macro redefinition c:\users\ashu\postgresql\src\include\pg_config_manual.h20 Apart from these, I am not having any comments as of now. I am still validating the patch on Windows. If I find any issues i will update it. -- With Regards, Ashutosh Sharma. EnterpriseDB: http://www.enterprisedb.com On Tue, Feb 28, 2017 at 10:36 AM, Beena Emerson wrote: > > > On Tue, Feb 28, 2017 at 9:45 AM, Jim Nasby wrote: >> >> On 2/24/17 6:30 AM, Kuntal Ghosh wrote: >>> >>> * You're considering any WAL file with a power of 2 as valid. Suppose, >>> the correct WAL seg size is 64mb. For some reason, the server >>> generated a 16mb invalid WAL file(maybe it crashed while creating the >>> WAL file). Your code seems to treat this as a valid file which I think >>> is incorrect. Do you agree with that? >> >> >> Detecting correct WAL size based on the size of a random WAL file seems >> like a really bad idea to me. >> >> >> I also don't see the reason for #2... or is that how initdb writes out the >> correct control file? > > > The initdb module reads the size from the option provided and sets the > environment variable. This variable is read in > src/backend/access/transam/xlog.c and the ControlFile written. > Unlike pg_resetwal and pg_rewind, pg_basebackup cannot access the Control > file. It only accesses the wal log folder. So we get the XLogSegSize from > the SHOW command using replication connection. > As Kuntal pointed out, I might need to set it from pg_receivewal.c as well. > > Thank you, > > Beena Emerson > > EnterpriseDB: https://www.enterprisedb.com/ > The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Parallel seq. plan is not coming against inheritance or partition table
On Tue, Mar 7, 2017 at 11:18 AM, Amit Kapila wrote: > On Tue, Mar 7, 2017 at 10:22 AM, Ashutosh Sharma > wrote: >>> I also think that commit didn't intend to change the behavior, >>> however, the point is how sensible is it to keep such behavior after >>> Parallel Append. I am not sure if this is the right time to consider >>> it or shall we wait till Parallel Append is committed. >>> >>>> I think the problem here is that compute_parallel_worker() thinks it >>>> can use 0 as a sentinel value that means "ignore this", but it can't >>>> really, because a heap can actually contain 0 pages. Here's a patch >>>> which does the following: >>>> >>> >>> - if (heap_pages < (BlockNumber) min_parallel_table_scan_size && >>> - index_pages < (BlockNumber) min_parallel_index_scan_size && >>> - rel->reloptkind == RELOPT_BASEREL) >>> + if (rel->reloptkind == RELOPT_BASEREL && >>> + ((heap_pages >= 0 && heap_pages < min_parallel_table_scan_size) || >>> + (index_pages >= 0 && index_pages < min_parallel_index_scan_size))) >>> return 0; >>> >>> The purpose of considering both heap and index pages () for >>> min_parallel_* is that even if one of them is bigger than threshold >>> then we should try parallelism. >> >> Yes, But, this is only true for normal tables not for partitioned or >> inheritance tables. I think for partition table, even if one heap page >> exist, we go for parallelism. >> >> This could be helpful for parallel >>> index scans where we consider parallel workers based on both heap and >>> index pages. Is there a reason for you to change that condition as >>> that is not even related to the problem being discussed? >>> >> >> I think he has changed to allow parallelism for inheritance or partition >> tables. For normal tables, it won't be touched until the below if-condition >> is satisfied. >> >> if (heap_pages < (BlockNumber) min_parallel_table_scan_size && >> index_pages < (BlockNumber) min_parallel_index_scan_size && >> rel->reloptkind == RELOPT_BASEREL) >> return 0; >> > > AFAICS, the patch has changed the if-condition you are quoting. > Yes, It has slightly changed the if condition or I would say it has corrected the if condition. The reason being, with new if condition in patch, we first check if reloptkind is of BASEREL type or not before checking if there are enough heap pages or index pages that is good to go for parallelism. In case if it is partition table 'rel->reloptkind == RELOPT_BASEREL' will fail hence, further conditions won't be checked. I think the most important thing that the patch fixes is passing index_pages as '-1' to compute_parallel_worker() as 'min_parallel_index_scan_size' can be set to zero by the user. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Supporting huge pages on Windows
Hi, I tried to test v8 version of patch. Firstly, I was able to start the postgresql server process with 'huge_pages' set to on. I had to follow the instructions given in MSDN[1] to enable lock pages in memory option and also had to start the postgresql server process as admin user. test=# show huge_pages; huge_pages on (1 row) To start with, I ran the regression test-suite and didn't find any failures. But, then I am not sure if huge_pages are getting used or not. However, upon checking the settings for huge_pages and I found it as 'on'. I am assuming, if huge pages is not being used due to shortage of large pages, it should have fallen back to non-huge pages. I also ran the pgbench tests on read-only workload and here are the results I got. pgbench -c 4 -j 4 - T 600 bench huge_pages=on, TPS = 21120.768085 huge_pages=off, TPS = 20606.288995 [1] - https://msdn.microsoft.com/en-IN/library/ms190730.aspx -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Thu, Feb 23, 2017 at 12:59 PM, Tsunakawa, Takayuki wrote: > From: Amit Kapila [mailto:amit.kapil...@gmail.com] >> > Hmm, the large-page requires contiguous memory for each page, so this >> error could occur on a long-running system where the memory is heavily >> fragmented. For example, please see the following page and check the memory >> with RAMMap program referred there. >> > >> >> I don't have RAMMap and it might take some time to investigate what is going >> on, but I think in such a case even if it works we should keep the default >> value of huge_pages as off on Windows. I request somebody else having >> access to Windows m/c to test this patch and if it works then we can move >> forward. > > You are right. I modified the patch so that the code falls back to the > non-huge page when CreateFileMapping() fails due to the shortage of large > pages. That's what the Linux version does. > > The other change is to parameterize the Win32 function names in the messages > in EnableLockPagePrivileges(). This is to avoid adding almost identical > messages unnecessarily. I followed Alvaro's comment. I didn't touch the two > existing sites that embed Win32 function names. I'd like to leave it up to > the committer to decide whether to change as well, because changing them > might make it a bit harder to apply some bug fixes to earlier releases. > > FYI, I could reproduce the same error as Amit on 32-bit Win7, where the total > RAM is 3.5 GB and available RAM is 2 GB. I used the attached largepage.c. > Immediately after the system boot, I could only allocate 8 large pages. When > I first tried to allocate 32 large pages, the test program produced: > > large page size = 2097152 > allocating 32 large pages... > CreateFileMapping failed: error code = 1450 > > You can build the test program as follows: > > cl largepage.c advapi32.lib > > Regards > Takayuki Tsunakawa > > > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Should we cacheline align PGXACT?
Hi, > > I agree. Probably Simon's patch for reducing pgxact access could negate > regression in pgxact alignment patch. > Ashutosh, could please you run read-write and read-only tests when both > these patches applied? I already had the results with both the patches applied. But, as I was not quite able to understand on how Simon's patch for reducing pgxact access could negate the regression on read-write workload that we saw with pgxact-align-3 patch earlier, I was slightly hesitant to share the results. Anyways, here are the results with combined patches on readonly and readwrite workload: 1) Results for read-only workload: pgbench -i -s 300 postgres pgbench -M prepared -c $thread -j $thread -T $time_for_reading -S postgres where, time_for_reading = 10mins *non default param:* shared_buffers=8GB max_connections=300 CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 4 36333 36835 1.381664052 8 70179 72496 3.301557446 16 169303 175743 3.803831001 32 328837 341093 3.727074508 64 363352 399847 10.04397939 72 372011 413437 11.13569222 128 443979 578355 30.26629638 180 321420 552373 71.85396055 196 276780 558896 101.927885 256 234563 568951 142.5578629 2) Results for read-write workload: = pgbench -i -s 300 postgres pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres where, time_for_reading = 30mins non default param: shared_buffers=8GB max_connections=300 CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 4 2683 2690 0.2609019754 8 5321 5332 0.2067280586 16 10348 10387 0.3768844221 32 19446 19754 1.58387329 64 28178 28198 0.0709773582 72 28296 28639 1.212185468 128 28577 28600 0.0804843056 180 26665 27525 3.225201575 196 27628 28511 3.19603301 256 28467 28529 0.2177960445 HEAD is basically referring to the following git commit in master branch, commit 5dbdb2f799232cb1b6df7d7a85d59ade3234d30c Author: Robert Haas Date: Fri Feb 24 12:21:46 2017 +0530 Make tablesample work with partitioned tables. This was an oversight in the original partitioning commit. Amit Langote, reviewed by David Fetter -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, > > I've assigned to review this patch. > At first, I'd like to notice that I like idea and general design. > Secondly, patch set don't apply cleanly to master. Please, rebase it. Thanks for showing your interest towards this patch. I would like to inform that this patch has got dependency on patch for 'Write Ahead Logging in hash index - [1]' and 'Microvacuum support in hash index [1]'. Hence, until above two patches becomes stable I may have to keep on rebasing this patch. However, I will try to share you the updated patch asap. > > > On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma > wrote: >> >> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >> patch rewrites the hash index scan module to work in page-at-a-time >> mode. It basically introduces two new functions-- _hash_readpage() and >> _hash_saveitem(). The former is used to load all the qualifying tuples >> from a target bucket or overflow page into an items array. The latter >> one is used by _hash_readpage to save all the qualifying tuples found >> in a page into an items array. Apart from that, this patch bascially >> cleans _hash_first(), _hash_next and hashgettuple(). > > > I see that forward and backward scan cases of function _hash_readpage contain > a lot of code duplication > Could you please refactor this function to have less code duplication? Sure, I will try to avoid the code duplication as much as possible. > > Also, I wonder if you have a special idea behind inserting data in test.sql > by 1002 separate SQL statements. > INSERT INTO con_hash_index_table (keycol) SELECT a FROM GENERATE_SERIES(1, > 1000) a; > > You can achieve the same result by execution of single SQL statement. > INSERT INTO con_hash_index_table (keycol) SELECT (a - 1) % 1000 + 1 FROM > GENERATE_SERIES(1, 1002000) a; > Unless you have some special idea of doing this in 1002 separate transactions. There is no reason for having so many INSERT statements in test.sql file. I think it would be better to replace it with single SQL statement. Thanks. [1]- https://www.postgresql.org/message-id/CAA4eK1KibVzgVETVay0%2BsiVEgzaXnP5R21BdWiK9kg9wx2E40Q%40mail.gmail.com [2]- https://www.postgresql.org/message-id/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)
Hi, I had a look into this patch and would like to share some of my review comments that requires author's attention. 1) The comment for page_checksum() needs to be corrected. It seems like it has been copied from page_header and not edited it further. /* * page_header * * Allows inspection of page header fields of a raw page */ PG_FUNCTION_INFO_V1(page_checksum); Datum page_checksum(PG_FUNCTION_ARGS) 2) It seems like you have choosen wrong datatype for page_checksum. I am getting negative checksum value when trying to run below query. I think the return type for the SQL function page_checksum should be 'integer' instead of 'smallint'. postgres=# SELECT page_checksum(get_raw_page('pg_class', 0), 100); page_checksum --- -19532 (1 row) Above problem also exist in case of page_header. I am facing similar problem with page_header as well. postgres=# SELECT page_header(get_raw_page('pg_class', 0)); page_header - (0/2891538,-28949,1,220,7216,8192,8192,4,0) (1 row) 3) Any reason for not marking bt_page_items or page_checksum as PARALLEL_SAFE. 4) Error messages in new bt_page_items are not consistent with old bt_page_items. For eg. if i pass meta page blocknumber as input to bt_page_items the error message thrown by new and old bt_page_items are different. postgres=# SELECT * FROM bt_page_items('btree_index', 0) LIMIT 1; ERROR: block 0 is a meta page postgres=# SELECT * FROM bt_page_items(get_raw_page('btree_index', 0)) LIMIT 1; ERROR: block is a meta page postgres=# SELECT * FROM bt_page_items(get_raw_page('btree_index', 1024)) LIMIT 1; ERROR: block number 1024 is out of range for relation "btree_index" postgres=# SELECT * FROM bt_page_items('btree_index', 1024) LIMIT 1; ERROR: block number out of range 5) Code duplication in bt_page_items() and bt_page_items_bytea() needs to be handled. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Tue, Mar 7, 2017 at 9:39 PM, Peter Eisentraut wrote: > On 3/6/17 16:33, Tomas Vondra wrote: >>> I think it would be better not to maintain so much duplicate code >>> between bt_page_items(text, int) and bt_page_items(bytea). How about >>> just redefining bt_page_items(text, int) as an SQL-language function >>> calling bt_page_items(get_raw_page($1, $2))? >>> >> >> Maybe. We can also probably share the code at the C level, so I'll look >> into that. > > I think SQL would be easier, but either way some reduction in > duplication would be good. > >>> For page_checksum(), the checksums are internally unsigned, so maybe >>> return type int on the SQL level, so that there is no confusion about >>> the sign? >>> >> >> That ship already sailed, I'm afraid. We already have checksum in the >> page_header() output, and it's defined as smallint there. So using int >> here would be inconsistent. > > OK, no worries then. > > -- > Peter Eisentraut http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Consistency checking for hash indexes
Couple of review comments,, You may also need to update the documentation as now we are also going to support wal consistency check for hash index. The current documentation does not include hash index. +only records originating from those resource managers. Currently, +the supported resource managers are heap, +heap2, btree, gin, +gist, sequence, spgist, +brin, and generic. Only Following comment in hash_mask() may require changes if patch for 'Microvacuum support for Hash Index - [1]' gets committed. + /* +* In hash bucket and overflow pages, it is possible to modify the +* LP_FLAGS without emitting any WAL record. Hence, mask the line +* pointer flags. +* See hashgettuple() for details. +*/ [1] - https://www.postgresql.org/message-id/CAE9k0PmXyQpHX8%3DL_hFV7HfPV8qrit19xoUB86waQ87rKYzmYQ%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Wed, Mar 8, 2017 at 2:32 PM, Kuntal Ghosh wrote: > On Fri, Mar 3, 2017 at 9:44 AM, Amit Kapila wrote: >> On Tue, Feb 28, 2017 at 11:06 AM, Kuntal Ghosh >> wrote: >>> Hello everyone, >>> >>> I've attached a patch which implements WAL consistency checking for >>> hash indexes. This feature is going to be useful for developing and >>> testing of WAL logging for hash index. >>> >> >> 2. >> + else if ((opaque->hasho_flag & LH_BUCKET_PAGE) || >> + (opaque->hasho_flag & LH_OVERFLOW_PAGE)) >> + { >> + /* >> + * In btree bucket and overflow pages, it is possible to modify the >> + * LP_FLAGS without emitting any WAL record. Hence, mask the line >> + * pointer flags. >> + * See hashgettuple() for details. >> + */ >> + mask_lp_flags(page); >> + } >> >> Again, this mechanism is also modified by patch "Microvacuum support >> for hash index", so above changes needs to be adjusted accordingly. >> Comment referring to btree is wrong, you need to refer hash. > I've corrected the text in the comment and re-based the patch on the > latest hash index patch for WAL logging[1]. As discussed in the > thread, Microvacuum patch can be re-based on top of this patch. > > > [1] > https://www.postgresql.org/message-id/CAA4eK1%2BmvCucroWQwX3S7aBR%3D0yBJGF%2BjQz4x4Cx9QVsMFTZUw%40mail.gmail.com > -- > Thanks & Regards, > Kuntal Ghosh > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)
Hi, > >> 2) It seems like you have choosen wrong datatype for page_checksum. I >> am getting negative checksum value when trying to run below query. I >> think the return type for the SQL function page_checksum should be >> 'integer' instead of 'smallint'. >> >> postgres=# SELECT page_checksum(get_raw_page('pg_class', 0), 100); >> page_checksum >> --- >> -19532 >> (1 row) >> >> Above problem also exist in case of page_header. I am facing similar >> problem with page_header as well. >> >> postgres=# SELECT page_header(get_raw_page('pg_class', 0)); >> page_header >> - >> (0/2891538,-28949,1,220,7216,8192,8192,4,0) >> (1 row) >> > > No. This is consistent with page_header() which is also using smallint for > the checksum value. > Yes. But, as i said earlier I am getting negative checksum value for page_header as well. Isn't that wrong. For eg. When I debug the following query, i could pd_checksum value as '40074' in gdb where page_header shows it as '-25462'. SELECT page_header(get_raw_page('pg_class', 0)); (gdb) ppage->pd_checksum $2 = 40074 postgres=# SELECT page_header(get_raw_page('pg_class', 0)); page_header - (0/304EDE0,-25462,1,220,7432,8192,8192,4,0) (1 row) I think pd_checksum in PageHeaderData is defined as uint16 (0 to 65,535) whereas in SQL function for page_header it is defined as smallint (-32768 to +32767). -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
Hi, Attached is the v6 patch for microvacuum in hash index rebased on top of 'v10 patch for WAL in hash index - [1]' and 'v1 patch for WAL consistency check for hash index - [2]'. [1] - https://www.postgresql.org/message-id/CAA4eK1%2Bk5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn%2B72o1UA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/flat/cagz5qcjlerun_zoo0edv6_y_d0o4tntmper7ivtlbg4rurj...@mail.gmail.com#cagz5qcjlerun_zoo0edv6_y_d0o4tntmper7ivtlbg4rurj...@mail.gmail.com Also, the patch (mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patch) to mask 'LH_PAGE_HAS_DEAD_TUPLES' flag which got added as a part of Microvacuum patch is attached with this mail. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Wed, Feb 1, 2017 at 10:30 AM, Michael Paquier wrote: > On Sat, Jan 28, 2017 at 8:02 PM, Amit Kapila wrote: >> On Fri, Jan 27, 2017 at 5:15 PM, Ashutosh Sharma >> wrote: >>>> >>>> Don't you think we should try to identify the reason of the deadlock >>>> error reported by you up thread [1]? I know that you and Ashutosh are >>>> not able to reproduce it, but still I feel some investigation is >>>> required to find the reason. It is quite possible that the test case >>>> is such that the deadlock is expected in rare cases, if that is the >>>> case then it is okay. I have not spent enough time on that to comment >>>> whether it is a test or code issue. >>> >>> I am finally able to reproduce the issue using the attached script >>> file (deadlock_report). Basically, once I was able to reproduce the >>> issue with hash index I also thought of checking it with a non unique >>> B-Tree index and was able to reproduce it with B-Tree index as well. >>> This certainly tells us that there is nothing wrong at the code level >>> rather there is something wrong with the test script which is causing >>> this deadlock issue. Well, I have ran pgbench with two different >>> configurations and my observations are as follows: >>> >>> 1) With Primary keys i.e. uinque values: I have never encountered >>> deadlock issue with this configuration. >>> >>> 2) With non unique indexes (be it hash or B-Tree): I have seen >>> deadlock many a times with this configuration. Basically when we have >>> non-unique values associated with a column then there is a high >>> probability that multiple records will get updated with a single >>> 'UPDATE' statement and when there are huge number of backends trying >>> to do so there is high chance of getting deadlock which i assume is >>> expected behaviour in database. >>> >> >> I agree with your analysis, surely trying to update multiple rows with >> same values from different backends can lead to deadlock. > > Moved that to CF 2017-03. > -- > Michael microvacuum_hash_index_v6.patch Description: application/download mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, >> I've assigned to review this patch. >> At first, I'd like to notice that I like idea and general design. >> Secondly, patch set don't apply cleanly to master. Please, rebase it. > > > Thanks for showing your interest towards this patch. I would like to > inform that this patch has got dependency on patch for 'Write Ahead > Logging in hash index - [1]' and 'Microvacuum support in hash index > [1]'. Hence, until above two patches becomes stable I may have to keep > on rebasing this patch. However, I will try to share you the updated > patch asap. > >> >> >> On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma >> wrote: >>> >>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >>> patch rewrites the hash index scan module to work in page-at-a-time >>> mode. It basically introduces two new functions-- _hash_readpage() and >>> _hash_saveitem(). The former is used to load all the qualifying tuples >>> from a target bucket or overflow page into an items array. The latter >>> one is used by _hash_readpage to save all the qualifying tuples found >>> in a page into an items array. Apart from that, this patch bascially >>> cleans _hash_first(), _hash_next and hashgettuple(). >> >> >> I see that forward and backward scan cases of function _hash_readpage >> contain a lot of code duplication >> Could you please refactor this function to have less code duplication? > > Sure, I will try to avoid the code duplication as much as possible. I had close look into hash_readpage() function and could see that there are only few if-else conditions which are similar for both forward and backward scan cases and can't be optimised further. However, If you have a cursory look into this function definition it looks like the code for forward and backward scan are exactly the same but that's not the case. Attached is the diff report (hash_readpage.html) for forward and backward scan code used in hash_readpage(). This shows what all lines in the hash_readpage() are same or different. Please note that before applying the patches for page scan mode in hash index you may need to first apply the following patches on HEAD, 1) v10 patch for WAL in hash index - [1] 2) v1 patch for wal consistency check for hash index - [2] 3) v6 patch for microvacuum support in hash index - [3] [1] - https://www.postgresql.org/message-id/CAA4eK1%2Bk5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn%2B72o1UA%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAGz5QCKPU2qX75B1bB_LuEC88xWZa5L55J0TLvYMVD8noSH3pA%40mail.gmail.com [3] - https://www.postgresql.org/message-id/CAE9k0PkYpAPDJBfgia08o7XhO8nypH9WoO9M8%3DdqLrwwObXKcw%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com Title: Text Compare Text Compare Produced: 3/14/2017 4:32:14 PM Mode: All Left file: C:\Users\ashu\Downloads\fwd_scan.txt Right file: C:\Users\ashu\Downloads\bwd_scan.txt +loop_top_fwd: <> +loop_top_bwd: + /* load items[] in ascending order */ + /* load items[] in descending order */ + itemIndex = 0; + itemIndex = MaxIndexTuplesPerPage; + = + + /* new page, locate starting position by binary search */ + /* new page, locate starting position by binary search */ + offnum = _hash_binsearch(page, so->hashso_sk_hash); <> + offnum = _hash_binsearch_last(page, so->hashso_sk_hash); + = + + while (offnum <= maxoff) <> + while (offnum >= FirstOffsetNumber) + { = + { + Assert(offnum >= FirstOffsetNumber); <> + Assert(offnum <= maxoff); + itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); = + itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); + + + /* + /* + * skip the tuples that are moved by split operation + * skip the tuples that are moved by split operation + * for the scan that has started when split was in + * for the scan that has started when split was in + * progress. Also, skip the tuples that are marked + * progress. Also, skip the tuples that are marked + * as dead. + * as dead. + */ + */ + if (
Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)
On Mar 14, 2017 5:37 PM, "Alvaro Herrera" wrote: Ashutosh Sharma wrote: > Yes. But, as i said earlier I am getting negative checksum value for > page_header as well. Isn't that wrong. For eg. When I debug the > following query, i could pd_checksum value as '40074' in gdb where > page_header shows it as '-25462'. Yes; the point is that this is a pre-existing problem, not one being introduced with this patch. In SQL we don't have unsigned types, and we have historically chosen a type with the same width instead of one with enough width to represent all possible unsigned values. IMO this patch is doing the right thing; if in the future we want to change that decision, that's fine but it's a separate patch Okay, understood . Thank you. With Regards, Ashutosh Sharma
Re: [HACKERS] Microvacuum support for Hash Index
> Generally, this patch looks like a pretty straightforward adaptation > of the similar btree mechanism to hash indexes, so if it works for > btree it ought to work here, too. But I noticed a few things while > reading through it. > > +/* Get RelfileNode from relation OID */ > +rel = relation_open(htup->t_tableOid, NoLock); > +rnode = rel->rd_node; > +relation_close(rel, NoLock); > itup->t_tid = htup->t_self; > -_hash_doinsert(index, itup); > +_hash_doinsert(index, itup, rnode); > > This is an awfully low-level place to be doing something like this. > I'm not sure exactly where this should be happening, but not in the > per-tuple callback. Okay, Now I have done this inside _hash_doinsert() instead of callback function. Please have a look into the attached v7 patch. > > +/* > + * If there's nothing running on the standby we don't need to derive a > + * full latestRemovedXid value, so use a fast path out of here. This > + * returns InvalidTransactionId, and so will conflict with all HS > + * transactions; but since we just worked out that that's zero people, > + * it's OK. > + */ > +if (CountDBBackends(InvalidOid) == 0) > +return latestRemovedXid; > > I see that this comment (and most of what surrounds it) was just > copied from the existing btree example, but isn't there a discrepancy > between the comment and the code? It says it returns > InvalidTransactionId, but it doesn't. Also, you dropped the XXX from > the btree original, and the following reachedConsistency check. It does return InvalidTransactionId if there are no backends running across any database in the standby. As shown below 'latestRemovedXid' is initialised with InvalidTransactionId, TransactionId latestRemovedXid = InvalidTransactionId; So, If there are no backend processes running across any databases in standby latestRemovedXid will be returned as it is. I have also added the note in XXX in above comment. Please check v7 patch attached with this mail. > > + * Hash Index delete records can conflict with standby queries.You might > + * think that vacuum records would conflict as well, but we've handled > > But they're not called delete records in a hash index. The function > is called hash_xlog_vacuum_one_page. The corresponding btree function > is btree_xlog_delete. So this comment needs a little more updating. Okay, I have tried to rephrase it to avoid the confusion. > > +if (IsBufferCleanupOK(bucket_buf)) > +{ > +_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf, > + (buf == bucket_buf) ? true : false, > + hnode); > +if (bucket_buf != buf) > +LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK); > + > +if (PageGetFreeSpace(page) >= itemsz) > +break; /* OK, now we have enough space */ > +} > > I might be missing something, but I don't quite see why this needs a > cleanup lock on the primary bucket page. I would think a cleanup lock > on the page we're actually modifying would suffice, and I think if > that's correct it would be a better way to go. If that's not correct, > then I think the comments needs some work. Thanks for your that suggestion... I spent a lot of time thinking on this and also had a small discussion with Amit but could not find any issue with taking cleanup lock on modified page instead of primary bucket page.I had to do some decent code changes for this. Attached v7 patch has the changes. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com microvacuum_hash_index_v7.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
> > I think one possibility is to get it using > indexrel->rd_index->indrelid in _hash_doinsert(). > Thanks. I have tried the same in the v7 patch shared upthread. > >> >> But they're not called delete records in a hash index. The function >> is called hash_xlog_vacuum_one_page. The corresponding btree function >> is btree_xlog_delete. So this comment needs a little more updating. >> >> +if (IsBufferCleanupOK(bucket_buf)) >> +{ >> +_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf, >> + (buf == bucket_buf) ? true : false, >> + hnode); >> +if (bucket_buf != buf) >> +LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK); >> + >> +if (PageGetFreeSpace(page) >= itemsz) >> +break; /* OK, now we have enough space */ >> +} >> >> I might be missing something, but I don't quite see why this needs a >> cleanup lock on the primary bucket page. I would think a cleanup lock >> on the page we're actually modifying would suffice, and I think if >> that's correct it would be a better way to go. >> > > Offhand, I also don't see any problem with it. I too found no problem with that... > > Few other comments: > 1. > + if (ndeletable > 0) > + { > + /* No ereport(ERROR) until changes are logged */ > + START_CRIT_SECTION(); > + > + PageIndexMultiDelete(page, deletable, ndeletable); > + > + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); > + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; > > You clearing this flag while logging the action, but same is not taken > care during replay. Any reasons? That's because we conditionally WAL Log this flag status and when we do so, we take a it's FPI. > > 2. > + /* No ereport(ERROR) until changes are logged */ > + START_CRIT_SECTION > (); > + > + PageIndexMultiDelete(page, deletable, ndeletable); > + > + pageopaque = > (HashPageOpaque) PageGetSpecialPointer(page); > + pageopaque->hasho_flag &= > ~LH_PAGE_HAS_DEAD_TUPLES; > + > + /* > + * Write-lock the meta page so that we can > decrement > + * tuple count. > + */ > + LockBuffer(metabuf, > BUFFER_LOCK_EXCLUSIVE); > > The lock on buffer should be acquired before critical section. Point taken. I have taken care of it in the v7 patch. > > 3. > - > /* > - * Since this can be redone later if needed, mark as a > hint. > - */ > - MarkBufferDirtyHint(buf, true); > + > if (so->numKilled < MaxIndexTuplesPerPage) > + { > + so- >>killedItems[so->numKilled].heapTid = so->hashso_heappos; > + so- >>killedItems[so->numKilled].indexOffset = > + > ItemPointerGetOffsetNumber(&(so->hashso_curpos)); > + so->numKilled++; > + > } > > By looking at above code, the first thing that comes to mind is when > numKilled can become greater than MaxIndexTuplesPerPage and why we are > ignoring the marking of dead tuples when it becomes greater than > MaxIndexTuplesPerPage. After looking at similar btree code, I realize > that it could > happen if user reverses the scan direction. I think you should > mention in comments that see btgettuple to know the reason of > numKilled overun test or something like that. Added comment. Please refer to v7 patch. > > 4. > + * We match items by heap TID before assuming they are the right ones to > + * delete. If an item has > moved off the current page due to a split, we'll > + * fail to find it and do nothing (this is not an > error case --- we assume > + * the item will eventually get marked in a future indexscan). > + */ > +void > +_hash_kill_items(IndexScanDesc scan) > > I think this comment doesn't make much sense for hash index because > item won't move off from the current page due to split, only later > cleanup can remove it. Yes. The reason being, no cleanup will happen when scan in progress. Corrected it . > > 5. > > + > /* > * Maximum size of a hash index item (it's okay to have only one per page) > > Spurious white space change. Fixed. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
On Wed, Mar 15, 2017 at 9:28 PM, Robert Haas wrote: > On Wed, Mar 15, 2017 at 11:37 AM, Ashutosh Sharma > wrote: >>> +/* Get RelfileNode from relation OID */ >>> +rel = relation_open(htup->t_tableOid, NoLock); >>> +rnode = rel->rd_node; >>> +relation_close(rel, NoLock); >>> itup->t_tid = htup->t_self; >>> -_hash_doinsert(index, itup); >>> +_hash_doinsert(index, itup, rnode); >>> >>> This is an awfully low-level place to be doing something like this. >>> I'm not sure exactly where this should be happening, but not in the >>> per-tuple callback. >> >> Okay, Now I have done this inside _hash_doinsert() instead of callback >> function. Please have a look into the attached v7 patch. > > In the btree case, the heap relation isn't re-opened from anywhere in > the btree code. I think we should try to do the same thing here. If > we add an argument for the heap relation to _hash_doinsert(), > hashinsert() can easily it down; it's already got that value > available. There are two other calls to _hash_doinsert: > > 1. _h_indexbuild() calls _hash_doinsert(). It's called only from > hashbuild(), which has the heap relation available. So we can just > add that as an extra argument to _h_indexbuild() and then from there > pass it to _hash_doinsert. > > 2. hashbuildCallback calls _hash_doinsert(). It's sixth argument is a > HashBuildState which is set up by hashbuild(), which has the heap > relation available. So we can just add an extra member to the > HashBuildState and have hashbuild() set it before calling > IndexBuildHeapScan. hashbuildCallback can then fish it out of the > HashBuildState and pass it to _hash_doinsert(). Okay, I have done the changes as suggested by you. Please refer to the attached v8 patch. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com microvacuum_hash_index_v8.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
> +action = XLogReadBufferForRedo(record, 0, &buffer); > + > +if (!IsBufferCleanupOK(buffer)) > +elog(PANIC, "hash_xlog_vacuum_one_page: failed to acquire > cleanup lock"); > > That could fail, I think, because of a pin from a Hot Standby backend. > You want to call XLogReadBufferForRedoExtended() with a third argument > of true. Yes, there is a possibility that a new backend may start in standby after we kill the conflicting backends. If the new backend has pin on the buffer which the startup process is trying to read then 'IsBufferCleanupOK' will fail thereby causing a startup process to PANIC. Come to think of it, shouldn't hash_xlog_split_allocate_page > be changed the same way? No, the reason being we are trying to allocate a new bucket page on standby so there can't be any backend which could have pin on a page that is yet to initialised. > > +/* > + * Let us mark the page as clean if vacuum removes the DEAD > tuples > + * from an index page. We do this by clearing > LH_PAGE_HAS_DEAD_TUPLES > + * flag. > + */ > > Maybe add: Clearing this flag is just a hint; replay won't redo this. Added. Please check the attached v9 patch. > > + * Hash Index records that are marked as LP_DEAD and being removed during > + * hash index tuple insertion can conflict with standby queries.You might > > The word Index shouldn't be capitalized here. There should be a space > before "You". Corrected. > > The formatting of this comment is oddly narrow: > > + * _hash_vacuum_one_page - vacuum just one index page. > + * Try to remove LP_DEAD items from the given page. We > + * must acquire cleanup lock on the page being modified > + * before calling this function. > > I'd add a blank line after the first line and reflow the rest to be > something more like 75 characters. pgindent evidently doesn't think > this needs reformatting, but it's oddly narrow. Corrected. > > I suggest changing the header comment of > hash_xlog_vacuum_get_latestRemovedXid like this: > > + * Get the latestRemovedXid from the heap pages pointed at by the index > + * tuples being deleted. See also btree_xlog_delete_get_latestRemovedXid, > + * on which this function is based. > > This is looking good. Changed as per suggestions. Attached v9 patch. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com microvacuum_hash_index_v9.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
On Mar 16, 2017 7:49 AM, "Robert Haas" wrote: On Wed, Mar 15, 2017 at 4:31 PM, Robert Haas wrote: > On Wed, Mar 15, 2017 at 3:54 PM, Ashutosh Sharma wrote: >> Changed as per suggestions. Attached v9 patch. Thanks. > > Wow, when do you sleep? Will have a look. Committed with a few corrections. Thanks Robert for the commit. Thank you Amit and Jesper for reviewing this patch. With Regards, Ashutosh Sharma
Re: [HACKERS] Microvacuum support for Hash Index
On Thu, Mar 16, 2017 at 11:11 AM, Amit Kapila wrote: > On Wed, Mar 15, 2017 at 9:23 PM, Ashutosh Sharma > wrote: >> >>> >>> Few other comments: >>> 1. >>> + if (ndeletable > 0) >>> + { >>> + /* No ereport(ERROR) until changes are logged */ >>> + START_CRIT_SECTION(); >>> + >>> + PageIndexMultiDelete(page, deletable, ndeletable); >>> + >>> + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); >>> + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; >>> >>> You clearing this flag while logging the action, but same is not taken >>> care during replay. Any reasons? >> >> That's because we conditionally WAL Log this flag status and when we >> do so, we take a it's FPI. >> > > Sure, but we are not clearing in conditionally. I am not sure, how > after recovery it will be cleared it gets set during normal operation. > Moreover, btree already clears similar flag during replay (refer > btree_xlog_delete). You were right. In case datachecksum is enabled or wal_log_hint is set to true, 'LH_PAGE_HAS_DEAD_TUPLES' will get wal logged and therefore needs to be cleared on the standby as well. Attached is the patch that clears this flag on standby during replay. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com 0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-on-standby-when.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Consistency checking for hash indexes
Hi, Attached is the patch that allows WAL consistency tool to mask 'LH_PAGE_HAS_DEAD_TUPLES' flag in hash index. The flag got added as a part of 'Microvacuum support for Hash index' patch . I have already tested it using Kuntal's WAL consistency tool and it works fine. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Wed, Mar 15, 2017 at 11:27 AM, Kuntal Ghosh wrote: > On Wed, Mar 15, 2017 at 12:32 AM, Robert Haas wrote: >> On Mon, Mar 13, 2017 at 10:36 AM, Ashutosh Sharma >> wrote: >>> Couple of review comments,, >>> >>> You may also need to update the documentation as now we are also going >>> to support wal consistency check for hash index. The current >>> documentation does not include hash index. >>> >>> +only records originating from those resource managers. Currently, >>> +the supported resource managers are heap, >>> +heap2, btree, gin, >>> +gist, sequence, spgist, >>> +brin, and generic. Only >> >> Did that, committed this. Also ran pgindent over hash_mask() and >> fixed an instance of dubious capitalization. > Thanks Robert for the commit. > > > -- > Thanks & Regards, > Kuntal Ghosh > EnterpriseDB: http://www.enterprisedb.com 0001-Allow-WAL-consistency-tool-to-mask-LH_PAGE_HAS_DEAD_.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
>>> Sure, but we are not clearing in conditionally. I am not sure, how >>> after recovery it will be cleared it gets set during normal operation. >>> Moreover, btree already clears similar flag during replay (refer >>> btree_xlog_delete). >> >> You were right. In case datachecksum is enabled or wal_log_hint is set >> to true, 'LH_PAGE_HAS_DEAD_TUPLES' will get wal logged and therefore >> needs to be cleared on the standby as well. >> > > I was thinking what bad can happen if we don't clear this flag during > replay, the main thing that comes to mind is that after crash > recovery, if the flag is set the inserts on that page might need to > traverse all the tuples on that page once the page is full even if > there are no dead tuples in that page. It can be later cleared when > there are dead tuples in that page and we actually delete them, but I > don't think it is worth the price to pay for not clearing the flag > during replay. Yes, you are absolutely correct. If we do not clear this flag during replay then there is a possibility of _hash_doinsert() unnecessarily scanning the page with no space assuming that the page has got some dead tuples in it which is not true. > >> Attached is the patch that >> clears this flag on standby during replay. >> > > Don't you think, we should also clear it during the replay of > XLOG_HASH_DELETE? We might want to log the clear of flag along with > WAL record for XLOG_HASH_DELETE. > Yes, it should be cleared. I completely missed this part in a hurry. Thanks for informing. I have taken care of it in the attached v2 patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com 0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-on-standby-during.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
On Fri, Mar 17, 2017 at 8:20 AM, Amit Kapila wrote: > On Thu, Mar 16, 2017 at 9:39 PM, Ashutosh Sharma > wrote: >>>> >>> >>> Don't you think, we should also clear it during the replay of >>> XLOG_HASH_DELETE? We might want to log the clear of flag along with >>> WAL record for XLOG_HASH_DELETE. >>> >> >> Yes, it should be cleared. I completely missed this part in a hurry. >> Thanks for informing. I have taken care of it in the attached v2 >> patch. >> > > + /* > + * Mark the page as not containing any LP_DEAD items. See comments > + * in hashbucketcleanup() for details. > + */ > + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); > + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; > > Your comment here says, refer hashbucketcleanup and in that function, > the comment says "Clearing this flag is just a hint; replay won't redo > this.". Both seems contradictory. You need to change the comment in > hashbucketcleanup. Done. Please check the attached v3 patch. As I said in my previous e-mail, I think you need > to record clearing of this flag in WAL record XLOG_HASH_DELETE as you > are not doing this unconditionally and then during replay clear it > only when the WAL record indicates the same. Thank you so much for putting that point. I too think that we should record the flag status in the WAL record and clear it only when required during replay. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com 0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-during-replay-v3.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WAL Consistency checking for hash indexes
On Fri, Mar 17, 2017 at 9:03 AM, Amit Kapila wrote: > On Thu, Mar 16, 2017 at 1:15 PM, Ashutosh Sharma > wrote: >> Hi, >> >> Attached is the patch that allows WAL consistency tool to mask >> 'LH_PAGE_HAS_DEAD_TUPLES' flag in hash index. The flag got added as a >> part of 'Microvacuum support for Hash index' patch . I have already >> tested it using Kuntal's WAL consistency tool and it works fine. >> > > + * unlogged. So, mask it. See _hash_kill_items(), MarkBufferDirtyHint() > + * for > details. > + */ > > I think in above comment, a reference to _hash_kill_items is > sufficient. Other than that patch looks okay. Okay, I have removed the reference to MarkBufferDirtyHint() from above comment. Attached is the v2 version of patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com 0001-Allow-WAL-consistency-tool-to-mask-LH_PAGE_HAS_DEAD_.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Microvacuum support for Hash Index
On Fri, Mar 17, 2017 at 6:13 PM, Amit Kapila wrote: > On Fri, Mar 17, 2017 at 12:27 PM, Ashutosh Sharma > wrote: >> On Fri, Mar 17, 2017 at 8:20 AM, Amit Kapila wrote: >> >> As I said in my previous e-mail, I think you need >>> to record clearing of this flag in WAL record XLOG_HASH_DELETE as you >>> are not doing this unconditionally and then during replay clear it >>> only when the WAL record indicates the same. >> >> Thank you so much for putting that point. I too think that we should >> record the flag status in the WAL record and clear it only when >> required during replay. >> > > I think hashdesc.c needs an update (refer case XLOG_HASH_DELETE:). Done. Thanks! > > - * flag. Clearing this flag is just a hint; replay won't redo this. > + * flag. Clearing this flag is just a hint; replay will check the > + * status of clear_dead_marking flag before redo it. > */ > if (tuples_removed && *tuples_removed > 0 && > opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES) > + { > opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES; > + clear_dead_marking = true; > + } > > > I feel the above comment is not required as you are logging this > action explicitly. That's right. I have removed it in the attached v4 patch. > > + bool clear_dead_marking; /* TRUE if VACUUM clears > > No need to write VACUUM explicitly, you can simply say "TRUE if this > operation clears ...". Corrected. Please find the attached v4 patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com 0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-during-replay-v4.patch Description: binary/octet-stream -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect and hash indexes
On Fri, Mar 17, 2017 at 10:54 PM, Jeff Janes wrote: > While trying to figure out some bloating in the newly logged hash indexes, > I'm looking into the type of each page in the index. But I get an error: > > psql -p 9876 -c "select hash_page_type(get_raw_page('foo_index_idx',x)) from > generate_series(1650,1650) f(x)" > > ERROR: page is not a hash page > DETAIL: Expected ff80, got . > > The contents of the page are: > > \xa400d8f203bf65c91800f01ff01f0420... > > (where the elided characters at the end are all zero) > > What kind of page is that actually? it is basically either a newly allocated bucket page or a freed overflow page. You are seeing this error because of following change in the WAL patch for hash index, Basically, when we started working on WAL for hash index, we found that WAL routine 'XLogReadBufferExtended' does not expect a page to be completely zero page else it returns Invalid Buffer. To fix this, we started initializing freed overflow page or new bucket pages using _hash_pageinit() which basically initialises page header portion but not it's special area where page type information is present. That's why you are seeing an ERROR saying 'page is not a hash page'. Actually pageinspect module needs to handle this type of page. Currently it is just handling zero pages but not an empty pages. I will submit a patch for this. And isn't it unhelpful to have the > pageinspect module throw errors, rather than returning a dummy value to > indicate there was an error? Well, this is something not specific to hash index. So, I have no answer :) -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect and hash indexes
On Sat, Mar 18, 2017 at 1:34 PM, Amit Kapila wrote: > On Sat, Mar 18, 2017 at 12:12 AM, Ashutosh Sharma > wrote: >> On Fri, Mar 17, 2017 at 10:54 PM, Jeff Janes wrote: >>> While trying to figure out some bloating in the newly logged hash indexes, >>> I'm looking into the type of each page in the index. But I get an error: >>> >>> psql -p 9876 -c "select hash_page_type(get_raw_page('foo_index_idx',x)) from >>> generate_series(1650,1650) f(x)" >>> >>> ERROR: page is not a hash page >>> DETAIL: Expected ff80, got . >>> >>> The contents of the page are: >>> >>> \xa400d8f203bf65c91800f01ff01f0420... >>> >>> (where the elided characters at the end are all zero) >>> >>> What kind of page is that actually? >> >> it is basically either a newly allocated bucket page or a freed overflow >> page. >> > > What makes you think that it can be a newly allocated page? > Basically, we always initialize the special space of newly allocated > page, so not sure what makes you deduce that it can be newly allocated > page. I came to know this from the following experiment. I created a hash index and kept on inserting data in it till the split happens. When split happened, I could see following values for firstblock and lastblock in _hash_alloc_buckets() Breakpoint 1, _hash_alloc_buckets (rel=0x7f6ac951ee30, firstblock=34, nblocks=32) at hashpage.c:993 (gdb) n (gdb) pfirstblock $15 = 34 (gdb) pnblocks $16 = 32 (gdb) n (gdb) plastblock $17 = 65 AFAIU, this bucket split resulted in creation of new bucket pages from block number 34 to 65. The contents for metap are as follows, (gdb) p*metap $18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples = 2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096, hashm_bmshift = 15, hashm_maxbucket = 32,hashm_highmask = 63, hashm_lowmask = 31, hashm_ovflpoint = 6, hashm_firstfree = 0, hashm_nmaps = 1, hashm_procid = 450, hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 }, hashm_mapp = {33,0 }} Now, if i try to check the page type for block number 65, this is what i see, test=# select * from hash_page_type(get_raw_page('con_hash_index', 65)); ERROR: page is not a hash page DETAIL: Expected ff80, got . test=# -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] ANALYZE command progress checker
Hi Vinayak, Here are couple of review comments that may need your attention. 1. Firstly, I am seeing some trailing whitespace errors when trying to apply your v3 patch using git apply command. [ashu@localhost postgresql]$ git apply pg_stat_progress_analyze_v3.patch pg_stat_progress_analyze_v3.patch:155: trailing whitespace. CREATE VIEW pg_stat_progress_analyze AS pg_stat_progress_analyze_v3.patch:156: trailing whitespace. SELECT pg_stat_progress_analyze_v3.patch:157: trailing whitespace. S.pid AS pid, S.datid AS datid, D.datname AS datname, pg_stat_progress_analyze_v3.patch:158: trailing whitespace. S.relid AS relid, pg_stat_progress_analyze_v3.patch:159: trailing whitespace. CASE S.param1 WHEN 0 THEN 'initializing' error: patch failed: doc/src/sgml/monitoring.sgml:521 2) The test-case 'rules' is failing. I think it's failing because in rules.sql 'ORDERBY viewname' is used which will put 'pg_stat_progress_analyze' ahead of 'pg_stat_progress_vacuum' in the list of catalog views. You may need to correct rules.out file accordingly. This is what i could see in rules.sql, SELECT viewname, definition FROM pg_views WHERE schemaname <> 'information_schema' ORDER BY viewname; I am still reviewing your patch and doing some testing. Will update if i find any issues. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Fri, Mar 17, 2017 at 3:16 PM, vinayak wrote: > > On 2017/03/17 10:38, Robert Haas wrote: >> >> On Fri, Mar 10, 2017 at 2:46 AM, vinayak >> wrote: >>> >>> Thank you for reviewing the patch. >>> >>> The attached patch incorporated Michael and Amit comments also. >> >> I reviewed this tonight. > > Thank you for reviewing the patch. >> >> +/* Report compute index stats phase */ >> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, >> + >> PROGRESS_ANALYZE_PHASE_COMPUTE_INDEX_STATS); >> >> Hmm, is there really any point to this? And is it even accurate? It >> doesn't look to me like we are computing any index statistics here; >> we're only allocating a few in-memory data structures here, I think, >> which is not a statistics computation and probably doesn't take any >> significant time. >> >> +/* Report compute heap stats phase */ >> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, >> + >> PROGRESS_ANALYZE_PHASE_COMPUTE_HEAP_STATS); >> >> The phase you label as computing heap statistics also includes the >> computation of index statistics. I wouldn't bother separating the >> computation of heap and index statistics; I'd just have a "compute >> statistics" phase that begins right after the comment that starts >> "Compute the statistics." > > Understood. Fixed in the attached patch. >> >> >> +/* Report that we are now doing index cleanup */ >> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, >> +PROGRESS_ANALYZE_PHASE_INDEX_CLEANUP); >> >> OK, this doesn't make any sense either. We are not cleaning up the >> indexes here. We are just closing them and releasing resources. I >> don't see why you need this phase at all; it can't last more than some >> tiny fraction of a second. It seems like you're trying to copy the >> exact same phases that exist for vacuum instead of thinking about what >> makes sense for ANALYZE. > > Understood. I have removed this phase. >> >> >> +/* Report number of heap blocks scaned so far*/ >> +pgstat_progress_update_param(PROGRESS_ANALYZE_HEAP_BLKS_SCANNED, >> targblock); >> >> While I don't think this is necessarily a bad thing to report, I find >> it very surprising that you're not reporting what seems to me like the >> most useful thing here - namely, the number of blocks that will be >> sampled in total and the number of those that you've selected. >> Instead, you're just reporting the length of the relation and the >> last-sampled block, which is a less-accurate guide to total progress. >> There are enough counters to report both things, so maybe we should. >> >> +/* Report total number of sample rows*/ >> +pgstat_progress_update_param(PROGRESS_ANALYZE_TOTAL_SAMPLE_ROWS, >> numrows); >> >> As an alternative to the previous paragraph, yet another thing you >> could report is number of rows sampled out of total rows to be >> sampled. But this isn't the way to do it: you are reporting the >> number of rows you sampled
Re: [HACKERS] ANALYZE command progress checker
Hi, I didn't find any major issues with the patch. It works as expected. However, I have few minor comments that I would like to share, + + Total number of sample rows. The sample it reads is taken randomly. + Its size depends on the default_statistics_target parameter value. + 1) I think it would be better if you could specify reference link to the guc variable 'default_statistics_target'. Something like this, . If you go through monitoring.sgml, you would find that there is reference link to all guc variables being used. 2) I feel that the 'computing_statistics' phase could have been splitted into 'computing_column_stats', 'computing_index_stats' Please let me know your thoughts on this. + certain commands during command execution. Currently, the command + which supports progress reporting is VACUUM and ANALYZE. This may be expanded in the future. 3) I think above needs to be rephrased. Something like...Currently, the supported progress reporting commands are 'VACUUM' and 'ANALYZE'. Moreover, I also ran your patch on Windows platform and didn't find any issues. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Sat, Mar 18, 2017 at 5:30 PM, Ashutosh Sharma wrote: > Hi Vinayak, > > Here are couple of review comments that may need your attention. > > 1. Firstly, I am seeing some trailing whitespace errors when trying to > apply your v3 patch using git apply command. > > [ashu@localhost postgresql]$ git apply pg_stat_progress_analyze_v3.patch > pg_stat_progress_analyze_v3.patch:155: trailing whitespace. > CREATE VIEW pg_stat_progress_analyze AS > pg_stat_progress_analyze_v3.patch:156: trailing whitespace. > SELECT > pg_stat_progress_analyze_v3.patch:157: trailing whitespace. > S.pid AS pid, S.datid AS datid, D.datname AS datname, > pg_stat_progress_analyze_v3.patch:158: trailing whitespace. > S.relid AS relid, > pg_stat_progress_analyze_v3.patch:159: trailing whitespace. > CASE S.param1 WHEN 0 THEN 'initializing' > error: patch failed: doc/src/sgml/monitoring.sgml:521 > > 2) The test-case 'rules' is failing. I think it's failing because in > rules.sql 'ORDERBY viewname' is used which will put > 'pg_stat_progress_analyze' ahead of 'pg_stat_progress_vacuum' in the > list of catalog views. You may need to correct rules.out file > accordingly. This is what i could see in rules.sql, > > SELECT viewname, definition FROM pg_views WHERE schemaname <> > 'information_schema' ORDER BY viewname; > > I am still reviewing your patch and doing some testing. Will update if > i find any issues. Thanks. > > -- > With Regards, > Ashutosh Sharma > EnterpriseDB:http://www.enterprisedb.com > > On Fri, Mar 17, 2017 at 3:16 PM, vinayak > wrote: >> >> On 2017/03/17 10:38, Robert Haas wrote: >>> >>> On Fri, Mar 10, 2017 at 2:46 AM, vinayak >>> wrote: >>>> >>>> Thank you for reviewing the patch. >>>> >>>> The attached patch incorporated Michael and Amit comments also. >>> >>> I reviewed this tonight. >> >> Thank you for reviewing the patch. >>> >>> +/* Report compute index stats phase */ >>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, >>> + >>> PROGRESS_ANALYZE_PHASE_COMPUTE_INDEX_STATS); >>> >>> Hmm, is there really any point to this? And is it even accurate? It >>> doesn't look to me like we are computing any index statistics here; >>> we're only allocating a few in-memory data structures here, I think, >>> which is not a statistics computation and probably doesn't take any >>> significant time. >>> >>> +/* Report compute heap stats phase */ >>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, >>> + >>> PROGRESS_ANALYZE_PHASE_COMPUTE_HEAP_STATS); >>> >>> The phase you label as computing heap statistics also includes the >>> computation of index statistics. I wouldn't bother separating the >>> computation of heap and index statistics; I'd just have a "compute >>> statistics" phase that begins right after the comment that starts >>> "Compute the statistics." >> >> Understood. Fixed in the attached patch. >>> >>> >>> +/* Report that we are now doing index cleanup */ >>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE, >>> +PROGRESS_ANALYZE_PHASE_INDEX_CLEANUP); >
Re: [HACKERS] pageinspect and hash indexes
On Mon, Mar 20, 2017 at 9:31 AM, Amit Kapila wrote: > On Sat, Mar 18, 2017 at 5:13 PM, Ashutosh Sharma > wrote: >> On Sat, Mar 18, 2017 at 1:34 PM, Amit Kapila wrote: >>> On Sat, Mar 18, 2017 at 12:12 AM, Ashutosh Sharma >>> wrote: >>>> On Fri, Mar 17, 2017 at 10:54 PM, Jeff Janes wrote: >>>>> While trying to figure out some bloating in the newly logged hash indexes, >>>>> I'm looking into the type of each page in the index. But I get an error: >>>>> >>>>> psql -p 9876 -c "select hash_page_type(get_raw_page('foo_index_idx',x)) >>>>> from >>>>> generate_series(1650,1650) f(x)" >>>>> >>>>> ERROR: page is not a hash page >>>>> DETAIL: Expected ff80, got . >>>>> >>>>> The contents of the page are: >>>>> >>>>> \xa400d8f203bf65c91800f01ff01f0420... >>>>> >>>>> (where the elided characters at the end are all zero) >>>>> >>>>> What kind of page is that actually? >>>> >>>> it is basically either a newly allocated bucket page or a freed overflow >>>> page. >>>> >>> >>> What makes you think that it can be a newly allocated page? >>> Basically, we always initialize the special space of newly allocated >>> page, so not sure what makes you deduce that it can be newly allocated >>> page. >> >> I came to know this from the following experiment. >> >> I created a hash index and kept on inserting data in it till the split >> happens. >> >> When split happened, I could see following values for firstblock and >> lastblock in _hash_alloc_buckets() >> >> Breakpoint 1, _hash_alloc_buckets (rel=0x7f6ac951ee30, firstblock=34, >> nblocks=32) at hashpage.c:993 >> (gdb) n >> (gdb) pfirstblock >> $15 = 34 >> (gdb) pnblocks >> $16 = 32 >> (gdb) n >> (gdb) plastblock >> $17 = 65 >> >> AFAIU, this bucket split resulted in creation of new bucket pages from >> block number 34 to 65. >> >> The contents for metap are as follows, >> >> (gdb) p*metap >> $18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples = >> 2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096, >> hashm_bmshift = 15, >> hashm_maxbucket = 32,hashm_highmask = 63, hashm_lowmask = 31, >> hashm_ovflpoint = 6, hashm_firstfree = 0, hashm_nmaps = 1, >> hashm_procid = 450, >> hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 }, >> hashm_mapp = {33,0 }} >> >> Now, if i try to check the page type for block number 65, this is what i see, >> >> test=# select * from hash_page_type(get_raw_page('con_hash_index', 65)); >> ERROR: page is not a hash page >> DETAIL: Expected ff80, got . >> test=# >> > > The contents of such a page should be zero and Jeff has reported some > valid-looking contents of the page. If you see this page contents as > zero, then we can conclude what Jeff is seeing was an freed overflow > page. As shown in the mail thread-[1], the contents of metapage are, (gdb) p*metap $18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples =2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096, hashm_bmshift = 15, hashm_maxbucket = 32,hashm_highmask = 63, hashm_lowmask = 31, hashm_ovflpoint = 6, hashm_firstfree = 0, hashm_nmaps = 1, hashm_procid = 450, hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 }, hashm_mapp = {33,0 }} postgres=# select spares from hash_metapage_info(get_raw_page('con_hash_index', 0)); spares --- {0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} (1 row) Here, if you see the spare page count is just 1 which corresponds to bitmap page. So, that means there is no overflow page in hash index table and neither I have ran any DELETE statements in my experiment that would result in freeing an overflow page. Also, the page header content for such a page is, $9 = {pd_lsn = {xlogid = 0, xrecoff = 23638120}, pd_checksum = 0, pd_flags = 0,pd_lower = 24, pd_upper = 8176,pd_special = 8176, pd_pagesize_version = 8196, pd_prune_xid = 0,pd_linp = 0x1f3aa88} >From values of pd_lower and pd_upper it is clear that it is an empty page. The content of this page is, \xb0308a011800f01ff01f042000. I think it is not just happening for freed overflow but also for newly allocated
[HACKERS] comments in hash_alloc_buckets
Hi, While working on - [1], I realised that the following comments in _hash_alloc_buckets() needs to be corrected. /* * Initialize the freed overflow page. Just zeroing the page won't work, * See _hash_freeovflpage for similar usage. */ _hash_pageinit(page, BLCKSZ); Attached is the patch that corrects above comment. Thanks. [1] - https://www.postgresql.org/message-id/CAMkU%3D1y6NjKmqbJ8wLMhr%3DF74WzcMALYWcVFhEpm7i%3DmV%3DXsOg%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com corrected_comments_hash_alloc_buckets.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect and hash indexes
On Mon, Mar 20, 2017 at 6:53 PM, Ashutosh Sharma wrote: > On Mon, Mar 20, 2017 at 9:31 AM, Amit Kapila wrote: >> On Sat, Mar 18, 2017 at 5:13 PM, Ashutosh Sharma >> wrote: >>> On Sat, Mar 18, 2017 at 1:34 PM, Amit Kapila >>> wrote: >>>> On Sat, Mar 18, 2017 at 12:12 AM, Ashutosh Sharma >>>> wrote: >>>>> On Fri, Mar 17, 2017 at 10:54 PM, Jeff Janes wrote: >>>>>> While trying to figure out some bloating in the newly logged hash >>>>>> indexes, >>>>>> I'm looking into the type of each page in the index. But I get an error: >>>>>> >>>>>> psql -p 9876 -c "select hash_page_type(get_raw_page('foo_index_idx',x)) >>>>>> from >>>>>> generate_series(1650,1650) f(x)" >>>>>> >>>>>> ERROR: page is not a hash page >>>>>> DETAIL: Expected ff80, got . >>>>>> >>>>>> The contents of the page are: >>>>>> >>>>>> \xa400d8f203bf65c91800f01ff01f0420... >>>>>> >>>>>> (where the elided characters at the end are all zero) >>>>>> >>>>>> What kind of page is that actually? >>>>> >>>>> it is basically either a newly allocated bucket page or a freed overflow >>>>> page. >>>>> >>>> >>>> What makes you think that it can be a newly allocated page? >>>> Basically, we always initialize the special space of newly allocated >>>> page, so not sure what makes you deduce that it can be newly allocated >>>> page. >>> >>> I came to know this from the following experiment. >>> >>> I created a hash index and kept on inserting data in it till the split >>> happens. >>> >>> When split happened, I could see following values for firstblock and >>> lastblock in _hash_alloc_buckets() >>> >>> Breakpoint 1, _hash_alloc_buckets (rel=0x7f6ac951ee30, firstblock=34, >>> nblocks=32) at hashpage.c:993 >>> (gdb) n >>> (gdb) pfirstblock >>> $15 = 34 >>> (gdb) pnblocks >>> $16 = 32 >>> (gdb) n >>> (gdb) plastblock >>> $17 = 65 >>> >>> AFAIU, this bucket split resulted in creation of new bucket pages from >>> block number 34 to 65. >>> >>> The contents for metap are as follows, >>> >>> (gdb) p*metap >>> $18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples = >>> 2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096, >>> hashm_bmshift = 15, >>> hashm_maxbucket = 32,hashm_highmask = 63, hashm_lowmask = 31, >>> hashm_ovflpoint = 6, hashm_firstfree = 0, hashm_nmaps = 1, >>> hashm_procid = 450, >>> hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 }, >>> hashm_mapp = {33,0 }} >>> >>> Now, if i try to check the page type for block number 65, this is what i >>> see, >>> >>> test=# select * from hash_page_type(get_raw_page('con_hash_index', 65)); >>> ERROR: page is not a hash page >>> DETAIL: Expected ff80, got . >>> test=# >>> >> >> The contents of such a page should be zero and Jeff has reported some >> valid-looking contents of the page. If you see this page contents as >> zero, then we can conclude what Jeff is seeing was an freed overflow >> page. > > As shown in the mail thread-[1], the contents of metapage are, > > (gdb) p*metap > $18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples > =2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096, > hashm_bmshift = 15, hashm_maxbucket = 32,hashm_highmask = 63, > hashm_lowmask = 31, hashm_ovflpoint = 6, hashm_firstfree = 0, > hashm_nmaps = 1, > hashm_procid = 450, hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 25 times>}, hashm_mapp = {33,0 }} > > postgres=# select spares from > hash_metapage_info(get_raw_page('con_hash_index', 0)); > spares > --- > {0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0} > (1 row) > > Here, if you see the spare page count is just 1 which corresponds to > bitmap page. So, that means there is no overflow page in hash index > table and neither I have ran any DELETE statements in my experiment > that would result in freeing an overflow
Re: [HACKERS] segfault in hot standby for hash indexes
Hi Jeff, On Tue, Mar 21, 2017 at 1:54 PM, Amit Kapila wrote: > On Tue, Mar 21, 2017 at 1:28 PM, Jeff Janes wrote: >> Against an unmodified HEAD (17fa3e8), I got a segfault in the hot standby. >> > > I think I see the problem in hash_xlog_vacuum_get_latestRemovedXid(). > It seems to me that we are using different block_id for registering > the deleted items in xlog XLOG_HASH_VACUUM_ONE_PAGE and then using > different block_id for fetching those items in > hash_xlog_vacuum_get_latestRemovedXid(). So probably matching those > will fix this issue (instead of fetching block number and items from > block_id 1, we should use block_id 0). > Thanks for reporting this issue. As Amit said, it is happening due to block_id mismatch. Attached is the patch that fixes the same. I apologise for such a silly mistake. Please note that I was not able to reproduce the issue on my local machine using the test script you shared. Could you please check with the attached patch if you are still seeing the issue. Thanks in advance. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com corrected_block_id_reference_in_hash_vacuum_get_latestRemovedXid.patch Description: application/download -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segfault in hot standby for hash indexes
Hi Jeff, > > I can confirm that that fixes the seg faults for me. Thanks for confirmation. > > Did you mean you couldn't reproduce the problem in the first place, or that > you could reproduce it and now the patch fixes it? If the first of those, I > forget to say you do have to wait for hot standby to reach a consistency and > open for connections, and then connect to the standby ("psql -p 9874"), > before the seg fault will be triggered. I meant that I was not able to reproduce the issue on HEAD. > > But, there are places where hash_xlog_vacuum_get_latestRemovedXid diverges > from btree_xlog_delete_get_latestRemovedXid, which I don't understand the > reason for the divergence. Is there a reason we dropped the PANIC if we > have not reached consistency? Well, I'm not quite sure how would standby allow any backend to connect to it until it has reached to a consistent state. If you see the definition of btree_xlog_delete_get_latestRemovedXid(), just before consistency check there is a if-condition 'if (CountDBBackends(InvalidOid) == 0)' which means we are checking for consistent state only after knowing that there are some backends connected to the standby. So, Is there a possibility of having some backend connected to standby server without having it in consistent state. That is a case which should never happen, but > it seems worth preserving the PANIC. And why does this code get 'unused' > from XLogRecGetBlockData(record, 0, &len), while the btree code gets it from > xlrec? Is that because the record being replayed is structured differently > between btree and hash, or is there some other reason? > Yes, That's right. In case of btree index, we have used XLogRegisterData() to add data to WAL record and to fetch the same we use XLogRecGetData(). In case of hash index we have associated the same data with some registered buffer using XLogRegisterBufferData(0,...) and to fetch the same we use XLogRecGetBlockData(0,...). Now, if you see XLogRecordAssemble, it first adds all the data assciated with registered buffers into the WAL record followed by the main data (). Hence, the WAL record in btree and hash are organised differently. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segfault in hot standby for hash indexes
Hi, On Wed, Mar 22, 2017 at 8:41 AM, Amit Kapila wrote: > On Tue, Mar 21, 2017 at 11:49 PM, Ashutosh Sharma > wrote: >>> >>> I can confirm that that fixes the seg faults for me. >> >> Thanks for confirmation. >> >>> >>> Did you mean you couldn't reproduce the problem in the first place, or that >>> you could reproduce it and now the patch fixes it? If the first of those, I >>> forget to say you do have to wait for hot standby to reach a consistency and >>> open for connections, and then connect to the standby ("psql -p 9874"), >>> before the seg fault will be triggered. >> >> I meant that I was not able to reproduce the issue on HEAD. >> >>> >>> But, there are places where hash_xlog_vacuum_get_latestRemovedXid diverges >>> from btree_xlog_delete_get_latestRemovedXid, which I don't understand the >>> reason for the divergence. Is there a reason we dropped the PANIC if we >>> have not reached consistency? >> >> Well, I'm not quite sure how would standby allow any backend to >> connect to it until it has reached to a consistent state. If you see >> the definition of btree_xlog_delete_get_latestRemovedXid(), just >> before consistency check there is a if-condition 'if >> (CountDBBackends(InvalidOid) == 0)' which means >> we are checking for consistent state only after knowing that there are >> some backends connected to the standby. So, Is there a possibility of >> having some backend connected to standby server without having it in >> consistent state. >> > > I don't think so, but I think we should have reachedConsistency check > and elog(PANIC,..) similar to btree. If you see other conditions > where we PANIC in btree or hash xlog code, you will notice that those > are also theoretically not possible cases. It seems this is to save > database from getting corrupt or behaving insanely if due to some > reason (like a coding error or others) the check fails. okay, agreed. I have included it in the attached patch. However, I am still able to see the crash reported by Jeff upthread - [1]. I think there are couple of things that needs to be corrected inside hash_xlog_vacuum_get_latestRemovedXid(). 1) As Amit mentioned in his earlier mail [2], the block id used for registering deleted items in XLOG_HASH_VACUUM_ONE_PAGE is different than the block id used for fetching those items. I had fixed this and shared the patch earlier. With this patch I still see the crash Jeff reported yesterday [1]. 2) When a full page image of registered block is taken, the modified data associated with that block is not included in the WAL record. In such a case, we won't be able to fetch the items array that we have tried to include during xlog record (XLOG_HASH_VACUUM_ONE_PAGE) creation using following function. XLogRegisterBufData(0, (char *) deletable, ndeletable * sizeof(OffsetNumber)); If above is not included in the WAL record, then fetching the same using 'XLogRecGetBlockData(record, 0, &len);' will return NULL pointer. ptr = XLogRecGetBlockData(record, 0, &len); unused = (OffsetNumber *) ptr; iitemid = PageGetItemId(ipage, unused[i]); Here, if ptr is NULL, reference to unused will cause a crash. To fix this, I think we should pass 'REGBUF_KEEP_DATA' while registering the buffer. Something like this, - XLogRegisterBuffer(0, buf, REGBUF_STANDARD); + XLogRegisterBuffer(0, buf, REGBUF_STANDARD | REGBUF_KEEP_DATA); Attached is the patch that fixes this issue. Please have a look and let me know if you still have any concerns. Thank you. [1] - https://www.postgresql.org/message-id/CAMkU%3D1w-9Qe%3DFf1o6bSaXpNO9wqpo7_9GL8_CVhw4BoVVHasqg%40mail.gmail.com [2] - https://www.postgresql.org/message-id/CAA4eK1%2BYUedok0%2Bmeynnf8K3fqFsfdMpEFz1JiKLyyNv46hjaA%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c index de7522e..75c7c43 100644 --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -977,12 +977,20 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record) return latestRemovedXid; /* + * Check if WAL replay has reached a consistent database state. If not, + * we must PANIC. See the definition of btree_xlog_delete_get_latestRemovedXid + * on which this function is based. + */ + if (!reachedConsistency) + elog(PANIC, "hash_xlog_vacuum_get_latestRemovedXid: cannot operate with inconsistent data"); + + /* * Get index page. If the DB is consistent, this should not fail, nor * should any of the heap page fetches below. If one does, we return * I
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, >> Attached patch modifies hash index scan code for page-at-a-time mode. >> For better readability, I have splitted it into 3 parts, >> > > Due to the commits on master these patches applies with hunks. > > The README should be updated to mention the use of page scan. Done. Please refer to the attached v2 version of patch. > > hash.h needs pg_indent. Fixed. > >> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >> patch rewrites the hash index scan module to work in page-at-a-time >> mode. It basically introduces two new functions-- _hash_readpage() and >> _hash_saveitem(). The former is used to load all the qualifying tuples >> from a target bucket or overflow page into an items array. The latter >> one is used by _hash_readpage to save all the qualifying tuples found >> in a page into an items array. Apart from that, this patch bascially >> cleans _hash_first(), _hash_next and hashgettuple(). >> > > For _hash_next I don't see this - can you explain ? Sorry, It was wrongly copied from btree code. I have corrected it now. Please check the attached v2 verison of patch. > > + * > + * On failure exit (no more tuples), we release pin and set > + * so->currPos.buf to InvalidBuffer. > > > + * Returns true if any matching items are found else returns false. > > s/Returns/Return/g Done. > >> 2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch: >> this patch basically removes the redundant function _hash_step() and >> some of the unused members of HashScanOpaqueData structure. >> > > Looks good. > >> 3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch: >> this patch basically improves the locking strategy for VACUUM in hash >> index. As the new hash index scan works in page-at-a-time, vacuum can >> release the lock on previous page before acquiring a lock on the next >> page, hence, improving hash index concurrency. >> > > +* As the new hash index scan work in page at a time mode, > > Remove 'new'. Done. > >> I have also done the benchmarking of this patch and would like to >> share the results for the same, >> >> Firstly, I have done the benchmarking with non-unique values and i >> could see a performance improvement of 4-7%. For the detailed results >> please find the attached file 'results-non-unique values-70ff', and >> ddl.sql, test.sql are test scripts used in this experimentation. The >> detail of non-default GUC params and pgbench command are mentioned in >> the result sheet. I also did the benchmarking with unique values at >> 300 and 1000 scale factor and its results are provided in >> 'results-unique-values-default-ff'. >> > > I'm seeing similar results, and especially with write heavy scenarios. Great..!! -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From 78445d11db7157908d5558eedd253b9b28fb1e3e Mon Sep 17 00:00:00 2001 From: ashu Date: Wed, 22 Mar 2017 18:43:40 +0530 Subject: [PATCH] Rewrite hash index scans to work a page at a timev2 Patch by Ashutosh Sharma --- src/backend/access/hash/README | 9 +- src/backend/access/hash/hash.c | 120 ++--- src/backend/access/hash/hashpage.c | 14 +- src/backend/access/hash/hashsearch.c | 330 ++- src/backend/access/hash/hashutil.c | 23 ++- src/include/access/hash.h| 44 + 6 files changed, 384 insertions(+), 156 deletions(-) diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README index 1541438..f0a7bdf 100644 --- a/src/backend/access/hash/README +++ b/src/backend/access/hash/README @@ -243,10 +243,11 @@ The reader algorithm is: -- then, per read request: reacquire content lock on current page step to next page if necessary (no chaining of content locks, but keep - the pin on the primary bucket throughout the scan; we also maintain - a pin on the page currently being scanned) - get tuple - release content lock + the pin on the primary bucket throughout the scan) +save all the matching tuples from current index page into an items array +release pin and content lock (but if it is primary bucket page retain +it's pin till the end of scan) +get tuple from an item array -- at scan shutdown: release all pins still held diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 34cc08f..2450ee1 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -268,65 +268,24 @@ bool hashgettuple(IndexScanDesc scan, ScanDirection dir) { HashScanOpaque so = (HashScanOpaque) scan->opaque; - Relation rel = scan->ind
Re: [HACKERS] pageinspect and hash indexes
>> I think it is not just happening for freed overflow but also for newly >> allocated bucket page. It would be good if we could mark freed >> overflow page as UNUSED page rather than just initialising it's header >> portion and leaving the page type in special area as it is. Attached >> is the patch '0001-mark_freed_ovflpage_as_UNUSED_pagetype.patch' that >> marks a freed overflow page as an unused page. >> > > _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf)); > + > + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); > + > + ovflopaque->hasho_prevblkno = InvalidBlockNumber; > + ovflopaque->hasho_nextblkno = InvalidBlockNumber; > + ovflopaque->hasho_bucket = -1; > + ovflopaque->hasho_flag = LH_UNUSED_PAGE; > + ovflopaque->hasho_page_id = HASHO_PAGE_ID; > + > > You need similar initialization in replay routine as well. I will do that and share the patch shortly. Thanks. > >> Also, I have now changed pageinspect module to handle unused and empty >> hash index page. Attached is the patch >> (0002-allow_pageinspect_handle_UNUSED_OR_EMPTY_hash_pages.patch) for >> the same. >> > > 1. > @@ -70,6 +70,17 @@ verify_hash_page(bytea *raw_page, int flags) > pageopaque = (HashPageOpaque) PageGetSpecialPointer(page); > + > + /* Check if it is an unused hash page. */ > + if (pageopaque->hasho_flag == LH_UNUSED_PAGE) > + return page; > > > I don't see we need to have a separate check for unused page, why > can't it be checked with other page types in below check. > > if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE && > pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE) That is because UNUSED page is as good as an empty page except that empty page doesn't have any pagetype. If we add condition for checking UNUSED page in above if check it will never show unused page as an unsed page rather it will always show it as an empty page. To avoid this, at the start of verify_hash_page function itself if we recognise page as UNUSED page we return immediately. > > 2. > + /* Check if it is an empty hash page. */ > + if (PageIsEmpty(page)) > + ereport(ERROR, > + (errcode(ERRCODE_INDEX_CORRUPTED), > + errmsg("index table contains empty page"))); > > > Do we want to give a separate message for EMPTY and NEW pages? Isn't > it better that the same error message can be given for both of them as > from user perspective there is not much difference between both the > messages? I think we should show separate message because they are two different type of pages. In the sense like, one is initialised whereas other is completely zero. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers
Hi All, I have tried to test 'group_update_clog_v11.1.patch' shared upthread by Amit on a high end machine. I have tested the patch with various savepoints in my test script. The machine details along with test scripts and the test results are shown below, Machine details: 24 sockets, 192 CPU(s) RAM - 500GB test script: \set aid random (1,3000) \set tid random (1,3000) BEGIN; SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; SAVEPOINT s1; SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE; SAVEPOINT s2; SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; SAVEPOINT s3; SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE; SAVEPOINT s4; SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE; SAVEPOINT s5; SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE; END; Non-default parameters == max_connections = 200 shared_buffers=8GB min_wal_size=10GB max_wal_size=15GB maintenance_work_mem = 1GB checkpoint_completion_target = 0.9 checkpoint_timeout=900 synchronous_commit=off pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres -f ~/test_script.sql where, time_for_reading = 10 mins Test Results: = With 3 savepoints = CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 128 50275 53704 6.82048732 64 62860 66561 5.887686923 8 18464 18752 1.559792028 With 5 savepoints = CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 128 46559 47715 2.482871196 64 52306 52082 -0.4282491492 8 12289 12852 4.581332899 With 7 savepoints = CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 128 41367 41500 0.3215123166 64 42996 41473 -3.542189971 8 9665 9657 -0.0827728919 With 10 savepoints == CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT 128 34513 34597 0.24338655 64 32581 32035 -1.67582 8 7293 7622 4.511175099 *Conclusion:* As seen from the test results mentioned above, there is some performance improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly better than HEAD, with 7 and 10 SP(s) we do see regression with patch. Therefore, I think the threshold value of 4 for number of subtransactions considered in the patch looks fine to me. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com On Tue, Mar 21, 2017 at 6:19 PM, Amit Kapila wrote: > On Mon, Mar 20, 2017 at 8:27 AM, Robert Haas > wrote: > > On Fri, Mar 17, 2017 at 2:30 AM, Amit Kapila > wrote: > >>> I was wondering about doing an explicit test: if the XID being > >>> committed matches the one in the PGPROC, and nsubxids matches, and the > >>> actual list of XIDs matches, then apply the optimization. That could > >>> replace the logic that you've proposed to exclude non-commit cases, > >>> gxact cases, etc. and it seems fundamentally safer. But it might be a > >>> more expensive test, too, so I'm not sure. > >> > >> I think if the number of subxids is very small let us say under 5 or > >> so, then such a check might not matter, but otherwise it could be > >> expensive. > > > > We could find out by testing it. We could also restrict the > > optimization to cases with just a few subxids, because if you've got a > > large number of subxids this optimization probably isn't buying much > > anyway. > > > > Yes, and I have modified the patch to compare xids and subxids for > group update. In the initial short tests (with few client counts), it > seems like till 3 savepoints we can win and 10 savepoints onwards > there is some regression or at the very least there doesn't appear to > be any benefit. We need more tests to identify what is the safe > number, but I thought it is better to share the patch to see if we > agree on the changes because if not, then the whole testing needs to > be repeated. Let me know what do you think about attached? > > > > -- > With Regards, > Amit Kapila. > EnterpriseDB: http://www.enterprisedb.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers > >
Re: [HACKERS] segfault in hot standby for hash indexes
On Thu, Mar 23, 2017 at 9:17 AM, Amit Kapila wrote: > > On Thu, Mar 23, 2017 at 8:43 AM, Amit Kapila wrote: > > On Wed, Mar 22, 2017 at 3:39 PM, Ashutosh Sharma > > wrote: > >> Hi, > >> > >> On Wed, Mar 22, 2017 at 8:41 AM, Amit Kapila > >> wrote: > >> > >> To fix this, I think we should pass 'REGBUF_KEEP_DATA' while > >> registering the buffer. Something like this, > >> > >> - XLogRegisterBuffer(0, buf, REGBUF_STANDARD); > >> + XLogRegisterBuffer(0, buf, REGBUF_STANDARD | > >> REGBUF_KEEP_DATA); > >> > >> Attached is the patch that fixes this issue. > >> > > > > I think this will work, but not sure if there is a merit to deviate > > from what btree does to handle this case. One thing I find slightly > > awkward in hash_xlog_vacuum_get_latestRemovedXid() is that you are > > using a number of tuples registered as part of fixed data > > (xl_hash_vacuum_one_page) to traverse the data registered as buf data. > > I think it will be better if we register offsets also in fixed part of > > data as we are doing btree case. Agreed. I have made the changes accordingly. Please check attached v2 patch. > > > > > > > Also another small point in this regard, do we need two separate > variables to track number of deleted items in below code? I think one > variable is sufficient. > > _hash_vacuum_one_page() > { > .. > deletable[ndeletable++] = offnum; > tuples_removed += 1;-- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com > .. > } > Yes, I think 'ndeletable' alone should be fine. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c index de7522e..d9ac42c 100644 --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -957,8 +957,6 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record) OffsetNumber hoffnum; TransactionId latestRemovedXid = InvalidTransactionId; int i; - char *ptr; - Size len; xlrec = (xl_hash_vacuum_one_page *) XLogRecGetData(record); @@ -977,12 +975,20 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record) return latestRemovedXid; /* + * Check if WAL replay has reached a consistent database state. If not, + * we must PANIC. See the definition of btree_xlog_delete_get_latestRemovedXid + * for more details. + */ + if (!reachedConsistency) + elog(PANIC, "hash_xlog_vacuum_get_latestRemovedXid: cannot operate with inconsistent data"); + + /* * Get index page. If the DB is consistent, this should not fail, nor * should any of the heap page fetches below. If one does, we return * InvalidTransactionId to cancel all HS transactions. That's probably * overkill, but it's safe, and certainly better than panicking here. */ - XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno); + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL); if (!BufferIsValid(ibuffer)) @@ -994,9 +1000,7 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record) * Loop through the deleted index items to obtain the TransactionId from * the heap items they point to. */ - ptr = XLogRecGetBlockData(record, 1, &len); - - unused = (OffsetNumber *) ptr; + unused = (OffsetNumber *) ((char *) xlrec + SizeOfHashVacuumOnePage); for (i = 0; i < xlrec->ntuples; i++) { @@ -1121,23 +1125,15 @@ hash_xlog_vacuum_one_page(XLogReaderState *record) if (action == BLK_NEEDS_REDO) { - char *ptr; - Size len; - - ptr = XLogRecGetBlockData(record, 0, &len); - page = (Page) BufferGetPage(buffer); - if (len > 0) + if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage) { OffsetNumber *unused; - OffsetNumber *unend; - unused = (OffsetNumber *) ptr; - unend = (OffsetNumber *) ((char *) ptr + len); + unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage); - if ((unend - unused) > 0) -PageIndexMultiDelete(page, unused, unend - unused); + PageIndexMultiDelete(page, unused, xldata->ntuples); } /* diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c index 8640e85..8699b5b 100644 --- a/src/backend/access/hash/hashinsert.c +++ b/src/backend/access/hash/hashinsert.c @@ -344,7 +344,6 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf, Page page = BufferGetPage(buf); HashPageOpaque pageopaque; HashMetaPage metap; - double tuples_removed = 0; /* Scan each tuple in page to see if it is marked as LP_DEAD */ maxoff = PageGetMaxOffsetNumber(page); @@ -355
Re: [HACKERS] pageinspect and hash indexes
Hi Amit, >>> + >>> + /* Check if it is an unused hash page. */ >>> + if (pageopaque->hasho_flag == LH_UNUSED_PAGE) >>> + return page; >>> >>> >>> I don't see we need to have a separate check for unused page, why >>> can't it be checked with other page types in below check. >>> >>> if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE && >>> pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE) >> >> That is because UNUSED page is as good as an empty page except that >> empty page doesn't have any pagetype. If we add condition for checking >> UNUSED page in above if check it will never show unused page as an >> unsed page rather it will always show it as an empty page. >> > > Oh, okay, but my main objection was that we should not check hash page > type (hasho_flag) without ensuring whether it is a hash page. Can you > try to adjust the above code so that this check can be moved after > hasho_page_id check? Yes, I got your point. I have done that but then i had to remove the check for PageIsEmpty(). Anyways, I think PageIsEmpty() condition will only be true for one page in entire hash index table and can be ignored. If you wish, I could mention about it in the documentation. > >> To avoid >> this, at the start of verify_hash_page function itself if we recognise >> page as UNUSED page we return immediately. >> >>> >>> 2. >>> + /* Check if it is an empty hash page. */ >>> + if (PageIsEmpty(page)) >>> + ereport(ERROR, >>> + (errcode(ERRCODE_INDEX_CORRUPTED), >>> + errmsg("index table contains empty page"))); >>> >>> >>> Do we want to give a separate message for EMPTY and NEW pages? Isn't >>> it better that the same error message can be given for both of them as >>> from user perspective there is not much difference between both the >>> messages? >> >> I think we should show separate message because they are two different >> type of pages. In the sense like, one is initialised whereas other is >> completely zero. >> > > I understand your point, but not sure if it makes any difference to user. > okay, I have now anyways removed the check for PageIsEmpty. Please refer to the attached '0002 allow_pageinspect_handle_UNUSED_hash_pages.patch' Also, I have attached '0001-Mark-freed-overflow-page-as-UNUSED-pagetype-v2.patch' that handles your comment mentioned in [1]. [1] - https://www.postgresql.org/message-id/CAA4eK1%2BVE_TDRLWpyeOf%2B7%2B6if68kgPNwO4guKo060rm_t3O5w%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From 86eb711ea80b495c9d2f5228558682d0e95c41eb Mon Sep 17 00:00:00 2001 From: ashu Date: Thu, 23 Mar 2017 16:47:17 +0530 Subject: [PATCH] Mark freed overflow page as UNUSED pagetype v2 --- src/backend/access/hash/hash_xlog.c | 9 + src/backend/access/hash/hashovfl.c | 13 - 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c index d9ac42c..2ccaf46 100644 --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -697,11 +697,20 @@ hash_xlog_squeeze_page(XLogReaderState *record) if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO) { Page ovflpage; + HashPageOpaque ovflopaque; ovflpage = BufferGetPage(ovflbuf); _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf)); + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + + ovflopaque->hasho_prevblkno = InvalidBlockNumber; + ovflopaque->hasho_nextblkno = InvalidBlockNumber; + ovflopaque->hasho_bucket = -1; + ovflopaque->hasho_flag = LH_UNUSED_PAGE; + ovflopaque->hasho_page_id = HASHO_PAGE_ID; + PageSetLSN(ovflpage, lsn); MarkBufferDirty(ovflbuf); } diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c index a3cae21..d3eaee0 100644 --- a/src/backend/access/hash/hashovfl.c +++ b/src/backend/access/hash/hashovfl.c @@ -591,9 +591,20 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, /* * Initialize the freed overflow page. Just zeroing the page won't work, * because WAL replay routines expect pages to be initialized. See - * explanation of RBM_NORMAL mode atop XLogReadBufferExtended. + * explanation of RBM_NORMAL mode atop XLogReadBufferExtended. Also, mark + * the page as UNUSED type and retain it's page id. This allows the tools + * like pageinspect to consider it as a hash page. */ _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf)); + + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + + ovflo
Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.
Hi, On Tue, Feb 7, 2017 at 9:23 AM, Robert Haas wrote: > On Mon, Feb 6, 2017 at 10:40 PM, Amit Kapila wrote: >>> Maybe we should call them "unused pages". >> >> +1. If we consider some more names for that column then probably one >> alternative could be "empty pages". > > Yeah, but I think "unused" might be better. Because a page could be > in use (as an overflow page or primary bucket page) and still be > empty. > Based on the earlier discussions, I have prepared a patch that would allow pgstathashindex() to show the number of unused pages in hash index. Please find the attached patch for the same. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From e3b59fa85f16d6d15be5360e85b7faf63e8683a9 Mon Sep 17 00:00:00 2001 From: ashu Date: Thu, 23 Mar 2017 23:02:26 +0530 Subject: [PATCH] Allow pgstathashindex to show unused pages v1 --- contrib/pgstattuple/expected/pgstattuple.out | 12 ++-- contrib/pgstattuple/pgstatindex.c | 19 --- contrib/pgstattuple/pgstattuple--1.4--1.5.sql | 1 + doc/src/sgml/pgstattuple.sgml | 17 - 4 files changed, 31 insertions(+), 18 deletions(-) diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out index 2c3515b..1f1ff46 100644 --- a/contrib/pgstattuple/expected/pgstattuple.out +++ b/contrib/pgstattuple/expected/pgstattuple.out @@ -132,9 +132,9 @@ select * from pgstatginindex('test_ginidx'); create index test_hashidx on test using hash (b); select * from pgstathashindex('test_hashidx'); - version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent --+--++--++++-- - 2 |4 | 0 |1 | 0 | 0 | 0 | 100 + version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | unused_pages | live_items | dead_items | free_percent +-+--++--++--+++-- + 2 |4 | 0 |1 | 0 |0 | 0 | 0 | 100 (1 row) -- these should error with the wrong type @@ -233,9 +233,9 @@ select pgstatindex('test_partition_idx'); (1 row) select pgstathashindex('test_partition_hash_idx'); - pgstathashindex -- - (2,8,0,1,0,0,0,100) +pgstathashindex +--- + (2,8,0,1,0,0,0,0,100) (1 row) drop table test_partitioned; diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c index d448e9e..6fc41d6 100644 --- a/contrib/pgstattuple/pgstatindex.c +++ b/contrib/pgstattuple/pgstatindex.c @@ -120,6 +120,7 @@ typedef struct HashIndexStat BlockNumber overflow_pages; BlockNumber bitmap_pages; BlockNumber zero_pages; + BlockNumber unused_pages; int64 live_items; int64 dead_items; @@ -588,8 +589,8 @@ pgstathashindex(PG_FUNCTION_ARGS) BufferAccessStrategy bstrategy; HeapTuple tuple; TupleDesc tupleDesc; - Datum values[8]; - bool nulls[8]; + Datum values[9]; + bool nulls[9]; Buffer metabuf; HashMetaPage metap; float8 free_percent; @@ -667,6 +668,8 @@ pgstathashindex(PG_FUNCTION_ARGS) } else if (opaque->hasho_flag & LH_BITMAP_PAGE) stats.bitmap_pages++; + else if (PageIsEmpty(page)) +stats.unused_pages++; else ereport(ERROR, (errcode(ERRCODE_INDEX_CORRUPTED), @@ -680,8 +683,9 @@ pgstathashindex(PG_FUNCTION_ARGS) /* Done accessing the index */ index_close(rel, AccessShareLock); - /* Count zero pages as free space. */ - stats.free_space += stats.zero_pages * stats.space_per_page; + /* Count zero and unused pages as free space. */ + stats.free_space += (stats.zero_pages + stats.unused_pages) * + stats.space_per_page; /* * Total space available for tuples excludes the metapage and the bitmap @@ -711,9 +715,10 @@ pgstathashindex(PG_FUNCTION_ARGS) values[2] = Int64GetDatum((int64) stats.overflow_pages); values[3] = Int64GetDatum((int64) stats.bitmap_pages); values[4] = Int64GetDatum((int64) stats.zero_pages); - values[5] = Int64GetDatum(stats.live_items); - values[6] = Int64GetDatum(stats.dead_items); - values[7] = Float8GetDatum(free_percent); + values[5] = Int64GetDatum((int64) stats.unused_pages); + values[6] = Int64GetDatum(stats.live_items); + values[7] = Int64GetDatum(stats.dead_items); + values[8] = Float8GetDatum(free_percent); tuple = heap_form_tuple(tupleDesc, values, nulls); PG_RETURN_DATUM(HeapTupleGetDatum(tuple)); diff --git a/contrib/pgstattuple/pgstattuple--1.4--1.5.sql b/contrib/pgstattuple/pgstattuple--1.4--1.5.sql index 84e112e..eda719a 100644 --- a/contr
Re: [HACKERS] Page Scan Mode in Hash Index
On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen wrote: > Hi, > > On 03/22/2017 09:32 AM, Ashutosh Sharma wrote: >> >> Done. Please refer to the attached v2 version of patch. >> > > Thanks. > >>>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this >>>> patch rewrites the hash index scan module to work in page-at-a-time >>>> mode. It basically introduces two new functions-- _hash_readpage() and >>>> _hash_saveitem(). The former is used to load all the qualifying tuples >>>> from a target bucket or overflow page into an items array. The latter >>>> one is used by _hash_readpage to save all the qualifying tuples found >>>> in a page into an items array. Apart from that, this patch bascially >>>> cleans _hash_first(), _hash_next and hashgettuple(). >>>> > > 0001v2: > > In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else' > part, and do the assignment inside > > if (so->numKilled < MaxIndexTuplesPerPage) > > instead. > Done. Please have a look into the attached v3 patch. > > No new comments for 0002 and 0003. okay. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From 4e953c35da2274165b00d763500b83e0f3f9e2a9 Mon Sep 17 00:00:00 2001 From: ashu Date: Thu, 23 Mar 2017 23:36:05 +0530 Subject: [PATCH] Rewrite hash index scans to work a page at a timev3 Patch by Ashutosh Sharma --- src/backend/access/hash/README | 9 +- src/backend/access/hash/hash.c | 121 +++-- src/backend/access/hash/hashpage.c | 14 +- src/backend/access/hash/hashsearch.c | 330 ++- src/backend/access/hash/hashutil.c | 23 ++- src/include/access/hash.h| 44 + 6 files changed, 385 insertions(+), 156 deletions(-) diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README index 1541438..f0a7bdf 100644 --- a/src/backend/access/hash/README +++ b/src/backend/access/hash/README @@ -243,10 +243,11 @@ The reader algorithm is: -- then, per read request: reacquire content lock on current page step to next page if necessary (no chaining of content locks, but keep - the pin on the primary bucket throughout the scan; we also maintain - a pin on the page currently being scanned) - get tuple - release content lock + the pin on the primary bucket throughout the scan) +save all the matching tuples from current index page into an items array +release pin and content lock (but if it is primary bucket page retain +it's pin till the end of scan) +get tuple from an item array -- at scan shutdown: release all pins still held diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 34cc08f..8c28fbd 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -268,66 +268,23 @@ bool hashgettuple(IndexScanDesc scan, ScanDirection dir) { HashScanOpaque so = (HashScanOpaque) scan->opaque; - Relation rel = scan->indexRelation; - Buffer buf; - Page page; OffsetNumber offnum; - ItemPointer current; bool res; + HashScanPosItem *currItem; /* Hash indexes are always lossy since we store only the hash code */ scan->xs_recheck = true; /* - * We hold pin but not lock on current buffer while outside the hash AM. - * Reacquire the read lock here. - */ - if (BufferIsValid(so->hashso_curbuf)) - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE); - - /* * If we've already initialized this scan, we can just advance it in the * appropriate direction. If we haven't done so yet, we call a routine to * get the first item in the scan. */ - current = &(so->hashso_curpos); - if (ItemPointerIsValid(current)) + if (!HashScanPosIsValid(so->currPos)) + res = _hash_first(scan, dir); + else { /* - * An insertion into the current index page could have happened while - * we didn't have read lock on it. Re-find our position by looking - * for the TID we previously returned. (Because we hold a pin on the - * primary bucket page, no deletions or splits could have occurred; - * therefore we can expect that the TID still exists in the current - * index page, at an offset >= where we were.) - */ - OffsetNumber maxoffnum; - - buf = so->hashso_curbuf; - Assert(BufferIsValid(buf)); - page = BufferGetPage(buf); - - /* - * We don't need test for old snapshot here as the current buffer is - * pinned, so vacuum can't clean the page. - */ - maxoffnum = PageGetMaxOffsetNumber(page); - for (offnum = ItemPointerGetOffsetNumber(current); - offnum <= maxoffnum; - offnum = OffsetNumberNext(offnum)) - { - IndexTuple itup; - - itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); - if (
Re: [HACKERS] Page Scan Mode in Hash Index
> Hi, > > On 03/23/2017 02:11 PM, Ashutosh Sharma wrote: >> >> On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen >> wrote: >>> >>> 0001v2: >>> >>> In hashgettuple() you can remove the 'currItem' and 'offnum' from the >>> 'else' >>> part, and do the assignment inside >>> >>> if (so->numKilled < MaxIndexTuplesPerPage) >>> >>> instead. >>> >> >> Done. Please have a look into the attached v3 patch. >> >>> >>> No new comments for 0002 and 0003. >> >> >> okay. Thanks. >> > > I'll keep the entry in 'Needs Review' if Alexander, or others, want to add > their feedback. okay. Thanks. > > (Best to post the entire patch series each time) I take your suggestion. Please find the attachments. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From 4e953c35da2274165b00d763500b83e0f3f9e2a9 Mon Sep 17 00:00:00 2001 From: ashu Date: Thu, 23 Mar 2017 23:36:05 +0530 Subject: [PATCH] Rewrite hash index scans to work a page at a timev3 Patch by Ashutosh Sharma --- src/backend/access/hash/README | 9 +- src/backend/access/hash/hash.c | 121 +++-- src/backend/access/hash/hashpage.c | 14 +- src/backend/access/hash/hashsearch.c | 330 ++- src/backend/access/hash/hashutil.c | 23 ++- src/include/access/hash.h| 44 + 6 files changed, 385 insertions(+), 156 deletions(-) diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README index 1541438..f0a7bdf 100644 --- a/src/backend/access/hash/README +++ b/src/backend/access/hash/README @@ -243,10 +243,11 @@ The reader algorithm is: -- then, per read request: reacquire content lock on current page step to next page if necessary (no chaining of content locks, but keep - the pin on the primary bucket throughout the scan; we also maintain - a pin on the page currently being scanned) - get tuple - release content lock + the pin on the primary bucket throughout the scan) +save all the matching tuples from current index page into an items array +release pin and content lock (but if it is primary bucket page retain +it's pin till the end of scan) +get tuple from an item array -- at scan shutdown: release all pins still held diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 34cc08f..8c28fbd 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -268,66 +268,23 @@ bool hashgettuple(IndexScanDesc scan, ScanDirection dir) { HashScanOpaque so = (HashScanOpaque) scan->opaque; - Relation rel = scan->indexRelation; - Buffer buf; - Page page; OffsetNumber offnum; - ItemPointer current; bool res; + HashScanPosItem *currItem; /* Hash indexes are always lossy since we store only the hash code */ scan->xs_recheck = true; /* - * We hold pin but not lock on current buffer while outside the hash AM. - * Reacquire the read lock here. - */ - if (BufferIsValid(so->hashso_curbuf)) - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE); - - /* * If we've already initialized this scan, we can just advance it in the * appropriate direction. If we haven't done so yet, we call a routine to * get the first item in the scan. */ - current = &(so->hashso_curpos); - if (ItemPointerIsValid(current)) + if (!HashScanPosIsValid(so->currPos)) + res = _hash_first(scan, dir); + else { /* - * An insertion into the current index page could have happened while - * we didn't have read lock on it. Re-find our position by looking - * for the TID we previously returned. (Because we hold a pin on the - * primary bucket page, no deletions or splits could have occurred; - * therefore we can expect that the TID still exists in the current - * index page, at an offset >= where we were.) - */ - OffsetNumber maxoffnum; - - buf = so->hashso_curbuf; - Assert(BufferIsValid(buf)); - page = BufferGetPage(buf); - - /* - * We don't need test for old snapshot here as the current buffer is - * pinned, so vacuum can't clean the page. - */ - maxoffnum = PageGetMaxOffsetNumber(page); - for (offnum = ItemPointerGetOffsetNumber(current); - offnum <= maxoffnum; - offnum = OffsetNumberNext(offnum)) - { - IndexTuple itup; - - itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); - if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid))) -break; - } - if (offnum > maxoffnum) - elog(ERROR, "failed to re-find scan position within index \"%s\"", - RelationGetRelationName(rel)); - ItemPointerSetOffsetNumber(current, offnum); - - /* * Check to see if we should kill the previousl
Re: [HACKERS] pageinspect and hash indexes
On Fri, Mar 24, 2017 at 9:21 AM, Amit Kapila wrote: > On Thu, Mar 23, 2017 at 6:46 PM, Ashutosh Sharma > wrote: >>> >>> Oh, okay, but my main objection was that we should not check hash page >>> type (hasho_flag) without ensuring whether it is a hash page. Can you >>> try to adjust the above code so that this check can be moved after >>> hasho_page_id check? >> >> Yes, I got your point. I have done that but then i had to remove the >> check for PageIsEmpty(). Anyways, I think PageIsEmpty() condition will >> only be true for one page in entire hash index table and can be >> ignored. If you wish, I could mention about it in the documentation. >> > > Yeah, I think it is worth adding a Note in docs, but we can do that > separately if required. > >>> >>>> To avoid >>>> this, at the start of verify_hash_page function itself if we recognise >>>> page as UNUSED page we return immediately. >>>> >>>>> >>>>> 2. >>>>> + /* Check if it is an empty hash page. */ >>>>> + if (PageIsEmpty(page)) >>>>> + ereport(ERROR, >>>>> + (errcode(ERRCODE_INDEX_CORRUPTED), >>>>> + errmsg("index table contains empty page"))); >>>>> >>>>> >>>>> Do we want to give a separate message for EMPTY and NEW pages? Isn't >>>>> it better that the same error message can be given for both of them as >>>>> from user perspective there is not much difference between both the >>>>> messages? >>>> >>>> I think we should show separate message because they are two different >>>> type of pages. In the sense like, one is initialised whereas other is >>>> completely zero. >>>> >>> >>> I understand your point, but not sure if it makes any difference to user. >>> >> >> okay, I have now anyways removed the check for PageIsEmpty. Please >> refer to the attached '0002 >> allow_pageinspect_handle_UNUSED_hash_pages.patch' >> > > + > if (pageopaque->hasho_page_id != HASHO_PAGE_ID) > ereport(ERROR, > > spurious white space. > >> >> Also, I have attached >> '0001-Mark-freed-overflow-page-as-UNUSED-pagetype-v2.patch' that >> handles your comment mentioned in [1]. >> > > In general, we have to initialize prevblock with max_bucket, but here > it is okay, because we anyway initialize it after this page is > allocated as overflow page. > > Your patches look good to me except for small white space change. Thanks for reviewing my patch. I have removed the extra white space. Attached are both the patches. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From 86eb711ea80b495c9d2f5228558682d0e95c41eb Mon Sep 17 00:00:00 2001 From: ashu Date: Thu, 23 Mar 2017 16:47:17 +0530 Subject: [PATCH] Mark freed overflow page as UNUSED pagetype v2 --- src/backend/access/hash/hash_xlog.c | 9 + src/backend/access/hash/hashovfl.c | 13 - 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c index d9ac42c..2ccaf46 100644 --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -697,11 +697,20 @@ hash_xlog_squeeze_page(XLogReaderState *record) if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO) { Page ovflpage; + HashPageOpaque ovflopaque; ovflpage = BufferGetPage(ovflbuf); _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf)); + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + + ovflopaque->hasho_prevblkno = InvalidBlockNumber; + ovflopaque->hasho_nextblkno = InvalidBlockNumber; + ovflopaque->hasho_bucket = -1; + ovflopaque->hasho_flag = LH_UNUSED_PAGE; + ovflopaque->hasho_page_id = HASHO_PAGE_ID; + PageSetLSN(ovflpage, lsn); MarkBufferDirty(ovflbuf); } diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c index a3cae21..d3eaee0 100644 --- a/src/backend/access/hash/hashovfl.c +++ b/src/backend/access/hash/hashovfl.c @@ -591,9 +591,20 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf, /* * Initialize the freed overflow page. Just zeroing the page won't work, * because WAL replay routines expect pages to be initialized. See - * explanation of RBM_NORMAL mode atop XLogReadBufferExtended. + * explanation of RBM_NORMAL mode atop XLogReadBufferExtended. Also, mark + * the page as UNUSED type and retain it's page id. This allows the tools + * like pageinspect to consider it as a hash page.
Re: [HACKERS] pageinspect and hash indexes
On Fri, Mar 24, 2017 at 9:46 AM, Ashutosh Sharma wrote: > On Fri, Mar 24, 2017 at 9:21 AM, Amit Kapila wrote: >> On Thu, Mar 23, 2017 at 6:46 PM, Ashutosh Sharma >> wrote: >>>> >>>> Oh, okay, but my main objection was that we should not check hash page >>>> type (hasho_flag) without ensuring whether it is a hash page. Can you >>>> try to adjust the above code so that this check can be moved after >>>> hasho_page_id check? >>> >>> Yes, I got your point. I have done that but then i had to remove the >>> check for PageIsEmpty(). Anyways, I think PageIsEmpty() condition will >>> only be true for one page in entire hash index table and can be >>> ignored. If you wish, I could mention about it in the documentation. >>> >> >> Yeah, I think it is worth adding a Note in docs, but we can do that >> separately if required. >> >>>> >>>>> To avoid >>>>> this, at the start of verify_hash_page function itself if we recognise >>>>> page as UNUSED page we return immediately. >>>>> >>>>>> >>>>>> 2. >>>>>> + /* Check if it is an empty hash page. */ >>>>>> + if (PageIsEmpty(page)) >>>>>> + ereport(ERROR, >>>>>> + (errcode(ERRCODE_INDEX_CORRUPTED), >>>>>> + errmsg("index table contains empty page"))); >>>>>> >>>>>> >>>>>> Do we want to give a separate message for EMPTY and NEW pages? Isn't >>>>>> it better that the same error message can be given for both of them as >>>>>> from user perspective there is not much difference between both the >>>>>> messages? >>>>> >>>>> I think we should show separate message because they are two different >>>>> type of pages. In the sense like, one is initialised whereas other is >>>>> completely zero. >>>>> >>>> >>>> I understand your point, but not sure if it makes any difference to user. >>>> >>> >>> okay, I have now anyways removed the check for PageIsEmpty. Please >>> refer to the attached '0002 >>> allow_pageinspect_handle_UNUSED_hash_pages.patch' >>> >> >> + >> if (pageopaque->hasho_page_id != HASHO_PAGE_ID) >> ereport(ERROR, >> >> spurious white space. >> >>> >>> Also, I have attached >>> '0001-Mark-freed-overflow-page-as-UNUSED-pagetype-v2.patch' that >>> handles your comment mentioned in [1]. >>> >> >> In general, we have to initialize prevblock with max_bucket, but here >> it is okay, because we anyway initialize it after this page is >> allocated as overflow page. >> >> Your patches look good to me except for small white space change. > > Thanks for reviewing my patch. I have removed the extra white space. > Attached are both the patches. Sorry, I have mistakenly attached wrong patch. Here are the correct set of patches. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From 86eb711ea80b495c9d2f5228558682d0e95c41eb Mon Sep 17 00:00:00 2001 From: ashu Date: Thu, 23 Mar 2017 16:47:17 +0530 Subject: [PATCH] Mark freed overflow page as UNUSED pagetype v2 --- src/backend/access/hash/hash_xlog.c | 9 + src/backend/access/hash/hashovfl.c | 13 - 2 files changed, 21 insertions(+), 1 deletion(-) diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c index d9ac42c..2ccaf46 100644 --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -697,11 +697,20 @@ hash_xlog_squeeze_page(XLogReaderState *record) if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO) { Page ovflpage; + HashPageOpaque ovflopaque; ovflpage = BufferGetPage(ovflbuf); _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf)); + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage); + + ovflopaque->hasho_prevblkno = InvalidBlockNumber; + ovflopaque->hasho_nextblkno = InvalidBlockNumber; + ovflopaque->hasho_bucket = -1; + ovflopaque->hasho_flag = LH_UNUSED_PAGE; + ovflopaque->hasho_page_id = HASHO_PAGE_ID; + PageSetLSN(ovflpage, lsn); MarkBufferDirty(ovflbuf); } diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c index a3cae21..d3eaee0 100644 --- a/src/backend/access/hash/hashovfl.c +++ b/src/backend/access/hash/hashovfl.c @@ -591,9 +591,20 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
Re: [HACKERS] segfault in hot standby for hash indexes
>>> > I think this will work, but not sure if there is a merit to deviate >>> > from what btree does to handle this case. One thing I find slightly >>> > awkward in hash_xlog_vacuum_get_latestRemovedXid() is that you are >>> > using a number of tuples registered as part of fixed data >>> > (xl_hash_vacuum_one_page) to traverse the data registered as buf data. >>> > I think it will be better if we register offsets also in fixed part of >>> > data as we are doing btree case. >> >> Agreed. I have made the changes accordingly. Please check attached v2 patch. >> > > Changes look good to me. I think you can modify the comments in > structure xl_hash_vacuum_one_page to mention "TARGET OFFSET NUMBERS > FOLLOW AT THE END" > Added the comment in xl_hash_vacuum_one_page structure. >>> >>> > >>> > >>> >>> Also another small point in this regard, do we need two separate >>> variables to track number of deleted items in below code? I think one >>> variable is sufficient. >>> >>> _hash_vacuum_one_page() >>> { >>> .. >>> deletable[ndeletable++] = offnum; >>> tuples_removed += 1;-- >>> >> >> Yes, I think 'ndeletable' alone should be fine. >> > > I think it would have been probably okay to use *int* for ntuples as > that matches with what you are actually assigning in the function. okay, corrected it. Attached is newer version of patch. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c index da4c2c5..2ccaf46 100644 --- a/src/backend/access/hash/hash_xlog.c +++ b/src/backend/access/hash/hash_xlog.c @@ -966,8 +966,6 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record) OffsetNumber hoffnum; TransactionId latestRemovedXid = InvalidTransactionId; int i; - char *ptr; - Size len; xlrec = (xl_hash_vacuum_one_page *) XLogRecGetData(record); @@ -986,12 +984,20 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record) return latestRemovedXid; /* + * Check if WAL replay has reached a consistent database state. If not, + * we must PANIC. See the definition of btree_xlog_delete_get_latestRemovedXid + * for more details. + */ + if (!reachedConsistency) + elog(PANIC, "hash_xlog_vacuum_get_latestRemovedXid: cannot operate with inconsistent data"); + + /* * Get index page. If the DB is consistent, this should not fail, nor * should any of the heap page fetches below. If one does, we return * InvalidTransactionId to cancel all HS transactions. That's probably * overkill, but it's safe, and certainly better than panicking here. */ - XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno); + XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno); ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL); if (!BufferIsValid(ibuffer)) @@ -1003,9 +1009,7 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record) * Loop through the deleted index items to obtain the TransactionId from * the heap items they point to. */ - ptr = XLogRecGetBlockData(record, 1, &len); - - unused = (OffsetNumber *) ptr; + unused = (OffsetNumber *) ((char *) xlrec + SizeOfHashVacuumOnePage); for (i = 0; i < xlrec->ntuples; i++) { @@ -1130,23 +1134,15 @@ hash_xlog_vacuum_one_page(XLogReaderState *record) if (action == BLK_NEEDS_REDO) { - char *ptr; - Size len; - - ptr = XLogRecGetBlockData(record, 0, &len); - page = (Page) BufferGetPage(buffer); - if (len > 0) + if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage) { OffsetNumber *unused; - OffsetNumber *unend; - unused = (OffsetNumber *) ptr; - unend = (OffsetNumber *) ((char *) ptr + len); + unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage); - if ((unend - unused) > 0) -PageIndexMultiDelete(page, unused, unend - unused); + PageIndexMultiDelete(page, unused, xldata->ntuples); } /* diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c index 8640e85..8699b5b 100644 --- a/src/backend/access/hash/hashinsert.c +++ b/src/backend/access/hash/hashinsert.c @@ -344,7 +344,6 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf, Page page = BufferGetPage(buf); HashPageOpaque pageopaque; HashMetaPage metap; - double tuples_removed = 0; /* Scan each tuple in page to see if it is marked as LP_DEAD */ maxoff = PageGetMaxOffsetNumber(page); @@ -355,10 +354,7 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf, ItemId itemId = PageGetItemId(page, offnum); if (ItemIdIsDead(itemId)) - { deletable[ndeletable++] = off
Re: [HACKERS] Supporting huge pages on Windows
Hi, On Fri, Mar 24, 2017 at 9:00 PM, David Steele wrote: > Hi Ashutosh, > > On 3/22/17 8:52 AM, Amit Kapila wrote: >> >> On Wed, Mar 22, 2017 at 12:07 AM, David Steele >> wrote: >>> >>> >>> Amit, Magnus, you are signed up as reviewers for this patch. Do you know >>> when you'll have a chance to take a look? >>> >> >> I have provided my feedback and I could not test it on my machine. I >> think Ashutosh Sharma has tested it. I can give it another look, but >> not sure if it helps. > > > I know you are not signed up as a reviewer, but have you take a look at this > patch? Well, I had a quick look into the patch just because I wanted the test it as I am having windows setup. But, I never looked into the patch from reviewer's perspective. The intention was to reverify the test results shared by Tsunawaka. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pageinspect and hash indexes
Thanks for reviewing my patch. I have removed the extra white space. Attached are both the patches. >>> >>> Sorry, I have mistakenly attached wrong patch. Here are the correct >>> set of patches. >> >> The latest version of patches looks fine to me. > > I don't really like 0002. What about this, instead? > > --- a/contrib/pageinspect/hashfuncs.c > +++ b/contrib/pageinspect/hashfuncs.c > @@ -80,7 +80,8 @@ verify_hash_page(bytea *raw_page, int flags) > /* Check that page type is sane. */ > pagetype = pageopaque->hasho_flag & LH_PAGE_TYPE; > if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE && > -pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE) > +pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE && > +pagetype != LH_UNUSED_PAGE) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > errmsg("invalid hash page type %08x", pagetype))); > > The advantage of that is (1) it won't get confused if in the future we > have an unused page that has some flag bit not in LH_PAGE_TYPE set, > and (2) if in the future we want to add any other checks to this > function which should apply to unused pages also, they won't get > bypassed by an early return statement. Agreed. Moreover, previous approach might even change the current behaviour of functions like hash_page_items() and hash_page_stats() basically when we pass UNUSED pages to these functions. Attached is the newer version of patch. From 5c54fa58895cd279ff44424a3eb1feafdaca69d5 Mon Sep 17 00:00:00 2001 From: ashu Date: Sat, 25 Mar 2017 01:02:35 +0530 Subject: [PATCH] Allow pageinspect to handle UNUSED hash pages v3 --- contrib/pageinspect/hashfuncs.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c index 812a03f..2632287 100644 --- a/contrib/pageinspect/hashfuncs.c +++ b/contrib/pageinspect/hashfuncs.c @@ -80,7 +80,8 @@ verify_hash_page(bytea *raw_page, int flags) /* Check that page type is sane. */ pagetype = pageopaque->hasho_flag & LH_PAGE_TYPE; if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE && - pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE) + pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE && + pagetype != LH_UNUSED_PAGE) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("invalid hash page type %08x", pagetype))); -- 1.8.3.1 -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.
On Sat, Mar 25, 2017 at 11:02 AM, Amit Kapila wrote: > On Thu, Mar 23, 2017 at 11:24 PM, Ashutosh Sharma > wrote: >> Hi, >> >> On Tue, Feb 7, 2017 at 9:23 AM, Robert Haas wrote: >>> On Mon, Feb 6, 2017 at 10:40 PM, Amit Kapila >>> wrote: >>>>> Maybe we should call them "unused pages". >>>> >>>> +1. If we consider some more names for that column then probably one >>>> alternative could be "empty pages". >>> >>> Yeah, but I think "unused" might be better. Because a page could be >>> in use (as an overflow page or primary bucket page) and still be >>> empty. >>> >> >> Based on the earlier discussions, I have prepared a patch that would >> allow pgstathashindex() to show the number of unused pages in hash >> index. Please find the attached patch for the same. Thanks. >> > > else if (opaque->hasho_flag & LH_BITMAP_PAGE) > stats.bitmap_pages++; > + else if (PageIsEmpty(page)) > + stats.unused_pages++; > > I think having this check after PageIsNew() makes more sense then > having at the place where you currently have, I don't think having it just below PageIsNew() check would be good. The reason being, there can be a bucket page which is in use but still be empty. So, if we add a check just below PageIsNew check then even though a page is marked as bucket page and is empty it will be shown as unused page which i feel is not correct. Here is one simple example that illustrates this point, Query: == select hash_page_type(get_raw_page('con_hash_index', 17)); gdb shows following info for this block number 17, (gdb) p *(PageHeader)page $1 = {pd_lsn = {xlogid = 0, xrecoff = 122297104}, pd_checksum = 0, pd_flags = 0, pd_lower = 24,pd_upper = 8176, pd_special = 8176, pd_pagesize_version = 8196, pd_prune_xid = 0,pd_linp = 0x22a82d8} (gdb) p((HashPageOpaque) PageGetSpecialPointer(page))->hasho_flag $2 = 66 (gdb) p(((HashPageOpaque) PageGetSpecialPointer(page))->hasho_flag & LH_BUCKET_PAGE) $3 = 2 >From above information it is clear that this page is a bucket page and is empty. I think it should be shown as bucket page rather than unused page. Also, this has already been discussed in [1]. other than that > code-wise your patch looks okay, although I haven't tested it. > I have tested it properly and didn't find any issues. > I think this should also be tracked under PostgreSQL 10 open items, > but as this is not directly a bug, so not sure, what do others think? Well, Even I am not sure if this has to be added under open items list or not. Thanks. [1] - https://www.postgresql.org/message-id/CA%2BTgmobwLKpKe99qnTJCBzFB4UaVGKrLNX3hwp8wcxObMDx7pA%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] comments in hash_alloc_buckets
>> While working on - [1], I realised that the following comments in >> _hash_alloc_buckets() needs to be corrected. >> >> /* >> * Initialize the freed overflow page. Just zeroing the page won't work, >> * See _hash_freeovflpage for similar usage. >> */ >> _hash_pageinit(page, BLCKSZ); >> >> Attached is the patch that corrects above comment. Thanks. >> > > - * Initialize the freed overflow page. Just zeroing the page won't work, > + * Initialize the last page in hash index. > > I think saying ".. last page in hash index" sounds slightly awkward as > this is the last page for current split point, how about just > "Initialize the page. ..." Yes, I mean just adding "Initialize the page. ..." looks more simple and correct. Attached is the patch with similar comment. diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c index 622cc4b..4fd9cbc 100644 --- a/src/backend/access/hash/hashpage.c +++ b/src/backend/access/hash/hashpage.c @@ -1002,7 +1002,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks) page = (Page) zerobuf; /* - * Initialize the freed overflow page. Just zeroing the page won't work, + * Initialize the page. Just zeroing the page won't work, * See _hash_freeovflpage for similar usage. */ _hash_pageinit(page, BLCKSZ); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.
>>>>>> >>>>>> +1. If we consider some more names for that column then probably one >>>>>> alternative could be "empty pages". >>>>> >>>>> Yeah, but I think "unused" might be better. Because a page could be >>>>> in use (as an overflow page or primary bucket page) and still be >>>>> empty. >>>>> >>>> >>>> Based on the earlier discussions, I have prepared a patch that would >>>> allow pgstathashindex() to show the number of unused pages in hash >>>> index. Please find the attached patch for the same. Thanks. >>>> >>> >>> else if (opaque->hasho_flag & LH_BITMAP_PAGE) >>> stats.bitmap_pages++; >>> + else if (PageIsEmpty(page)) >>> + stats.unused_pages++; >>> >>> I think having this check after PageIsNew() makes more sense then >>> having at the place where you currently have, >> >> I don't think having it just below PageIsNew() check would be good. >> The reason being, there can be a bucket page which is in use but still >> be empty. So, if we add a check just below PageIsNew check then even >> though a page is marked as bucket page and is empty it will be shown >> as unused page which i feel is not correct. Here is one simple example >> that illustrates this point, >> > > oh, is it a page where all the items have been deleted and no new > items have been inserted? Yes, it is a page from where items have been removed and no new insertion has happened. The reason why I have told that place is > not appropriate is because all the other checks in near by code are on > the page type and this check looks odd at that place, but we might > need to do this way if there is no other clean solution. I got your point but then i think that is the only one solution we have as of now. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint
Hi, > testing with master as of cf366e97ff, sqlsmith occasionally triggers the > following assertion: > > TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) > (&(bufHdr)->content_lock", File: "bufmgr.c", Line: 3397) > > Backtraces always look like the one below. It is reproducible on a > cluster once it happens. I could provide a tarball if needed. > > regards, > Andreas > > #2 0x008324b1 in ExceptionalCondition > (conditionName=conditionName@entry=0x9e4e28 "!(LWLockHeldByMe(((LWLock*) > (&(bufHdr)->content_lock", errorType=errorType@entry=0x87b03d > "FailedAssertion", fileName=fileName@entry=0x9e5856 "bufmgr.c", > lineNumber=lineNumber@entry=3397) at assert.c:54 > #3 0x00706971 in MarkBufferDirtyHint (buffer=2844, > buffer_std=buffer_std@entry=1 '\001') at bufmgr.c:3397 > #4 0x004b3ecd in _hash_kill_items (scan=scan@entry=0x66dcf70) at > hashutil.c:514 > #5 0x004a9c1b in hashendscan (scan=0x66dcf70) at hash.c:512 > #6 0x004cf17a in index_endscan (scan=0x66dcf70) at indexam.c:353 > #7 0x0061fa51 in ExecEndIndexScan (node=0x3093f30) at > nodeIndexscan.c:852 > #8 0x00608e59 in ExecEndNode (node=) at > execProcnode.c:715 > #9 0x006045b8 in ExecEndPlan (estate=0x3064000, planstate= out>) at execMain.c:1540 > #10 standard_ExecutorEnd (queryDesc=0x30cb880) at execMain.c:487 > #11 0x005c87b0 in PortalCleanup (portal=0x1a60060) at portalcmds.c:302 > #12 0x0085cbb3 in PortalDrop (portal=0x1a60060, > isTopCommit=) at portalmem.c:489 > #13 0x00736ed2 in exec_simple_query (query_string=0x315b7a0 "...") at > postgres.c: > #14 0x00738b51 in PostgresMain (argc=, > argv=argv@entry=0x1a6c6c8, dbname=, username=) > at postgres.c:4071 > #15 0x00475fef in BackendRun (port=0x1a65b90) at postmaster.c:4317 > #16 BackendStartup (port=0x1a65b90) at postmaster.c:3989 > #17 ServerLoop () at postmaster.c:1729 > #18 0x006c8662 in PostmasterMain (argc=argc@entry=4, > argv=argv@entry=0x1a3f540) at postmaster.c:1337 > #19 0x0047729d in main (argc=4, argv=0x1a3f540) at main.c:228 > Hi, Thanks for reporting this problem. Could you please let me know on for how long did you run sqlsmith to get this crash. However, I have found the reason for this crash. This is basically happening when trying to retrieve the tuples using cursor. Basically the current hash index scan work tuple-at-a-time which means once it finds tuple on page, it releases lock from the page but keeps pin on it and finally returns the tuple. When the requested number of tuples are processed there is no lock on the page that was being scanned but yes there is a pin on it. Finally, when trying to close a cursor at the end of scan, if any killed tuples has been identified we try to first mark these items as dead with the help of _hash_kill_items(). But, since we only have pin on this page, the assert check 'LWLockHeldByMe()' fails. When scanning tuples using normal SELECT * statement, before moving to next page in a bucket we first deal with all the killed items but we do this without releasing lock and pin on the current page. Hence, with SELECT queries this crash is not visible. The attached patch fixes this. But, please note that all these changes will get removed with the patch for page scan mode - [1]. [1] - https://www.postgresql.org/message-id/CA%2BTgmobYTvexcjqMhXoNCyEUHChzmdC_2xVGgj7eqaYVgoJA%2Bg%40mail.gmail.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 34cc08f..0bacef8 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -478,7 +478,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys, /* Before leaving current page, deal with any killed items */ if (so->numKilled > 0) - _hash_kill_items(scan); + _hash_kill_items(scan, false); _hash_dropscanbuf(rel, so); @@ -509,7 +509,7 @@ hashendscan(IndexScanDesc scan) /* Before leaving current page, deal with any killed items */ if (so->numKilled > 0) - _hash_kill_items(scan); + _hash_kill_items(scan, false); _hash_dropscanbuf(rel, so); diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c index 2d92049..414cc6a 100644 --- a/src/backend/access/hash/hashsearch.c +++ b/src/backend/access/hash/hashsearch.c @@ -467,7 +467,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir) /* Before leaving current page, deal with any killed items */ if (so->numKilled > 0) - _hash_kill_items(scan); + _hash_kill_items(scan, true); /* * ran off the end of this page, try
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, > I think you should consider refactoring this so that it doesn't need > to use goto. Maybe move the while (offnum <= maxoff) logic into a > helper function and have it return itemIndex. If itemIndex == 0, you > can call it again. okay, Added a helper function for _hash_readpage(). Please check v4 patch attached with this mail. Notice that the code in the "else" branch of the > if (itemIndex == 0) stuff could actually be removed from the else > block without changing anything, because the "if" block always either > returns or does a goto. That's worthy of a little more work to try to > make things simple and clear. Corrected. > > + * We find the first item(or, if backward scan, the last item) in > > Missing space. > Corrected. > In _hash_dropscanbuf(), the handling of hashso_bucket_buf is now > inconsistent with the handling of hashso_split_bucket_buf, which seems > suspicious. I have corrected it. Suppose we enter this function with so->hashso_bucket_buf > and so->currPos.buf both being valid buffers, but not the same one. > It looks to me as if we'll release the first pin but not the second > one. Yes, that is because we no need to release pin on currPos.buf if it is an overflow page. In page scan mode once we have saved all the qualified tuples from a current page (could be an overflow page) into items array we do release both pin and lock on a overflow page. This was not true earlier, let us assume a case where we are supposed to fetch only fixed number of tuples from a page using cursor and in such a case once the number of tuples to be fetched is completed we simply return with out releasing pin on a page being scanned. If this page is an overflow page then by end of scan we will end up with pin held on two buffers i.e. bucket buf and current buf which is basically overflow buf. My guess (which could be wrong) is that so->hashso_bucket_buf = > InvalidBuffer should be moved back up higher in the function where it > was before, just after the first if statement, and that the new > condition so->hashso_bucket_buf == so->currPos.buf on the last > _hash_dropbuf() should be removed. If that's not right, then I think > I need somebody to explain why not. Okay, as i mentioned above, in case of page scan mode we only keep pin on a bucket buf. There won't be any case where we will be having pin on overflow buf at the end of scan. So, basically _hash_dropscan buf() should only attempt to release pin on a bucket buf. And an attempt to release pin on so->currPos.buf should only happen when 'so->hashso_bucket_buf == so->currPos.buf' or when 'so->hashso_split_bucket_buf == so->currPos.buf' > > I am suspicious that _hash_kill_items() is going to have problems if > the overflow page is freed before it reacquires the lock. > _btkillitems() contains safeguards against similar cases. I have added similar check for hash_kill_items() as well. > > This is not a full review, but I'm out of time for the moment. No worries. I will be ready for your further review comments any time. Thanks for the review. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com From 3d0273f503d1645d6289bda78946a0af4b9e9f3a Mon Sep 17 00:00:00 2001 From: ashu Date: Mon, 27 Mar 2017 18:22:15 +0530 Subject: [PATCH] Rewrite hash index scans to work a page at a timev4 Patch by Ashutosh Sharma --- src/backend/access/hash/README | 9 +- src/backend/access/hash/hash.c | 124 ++ src/backend/access/hash/hashpage.c | 17 +- src/backend/access/hash/hashsearch.c | 445 +++ src/backend/access/hash/hashutil.c | 42 +++- src/include/access/hash.h| 46 +++- 6 files changed, 515 insertions(+), 168 deletions(-) diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README index 1541438..f0a7bdf 100644 --- a/src/backend/access/hash/README +++ b/src/backend/access/hash/README @@ -243,10 +243,11 @@ The reader algorithm is: -- then, per read request: reacquire content lock on current page step to next page if necessary (no chaining of content locks, but keep - the pin on the primary bucket throughout the scan; we also maintain - a pin on the page currently being scanned) - get tuple - release content lock + the pin on the primary bucket throughout the scan) +save all the matching tuples from current index page into an items array +release pin and content lock (but if it is primary bucket page retain +it's pin till the end of scan) +get tuple from an item array -- at scan shutdown: release all pins still held diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 0bacef8..bd2827a 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@
Re: [HACKERS] segfault in hot standby for hash indexes
>> >> >> >> I think it would have been probably okay to use *int* for ntuples as >> >> that matches with what you are actually assigning in the function. >> > >> > okay, corrected it. Attached is newer version of patch. >> > >> >> Thanks, this version looks good to me. > > > It solves the problem for me. Great..Thanks for confirming. I'd like to test that I get the right answer > on the standby, not just the absence of a crash, but I don't know how to do > that effectively. Has anyone used the new wal replay block consistency tool > on hash indexes since this microvacuum code was committed? Yes, I have used it for hash index but I used it after below commit, commit 9abbf4727de746222ad8fc15b17348065389ae43 Author: Robert Haas Date: Mon Mar 20 15:55:27 2017 -0400 Another fix for single-page hash index vacuum. The WAL consistency checking code needed to be updated for the new page status bit, but that didn't get done previously. All I did was set 'wal_consistency_checking = 'hash'' in postgresql.conf and ran test cases on primary server. If there is any inconsistent block on standby the tool would probably terminate the recovery process and you would see following message in the server logfile. "inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u" -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] segfault in hot standby for hash indexes
On Mar 27, 2017 22:25, "Robert Haas" wrote: On Fri, Mar 24, 2017 at 3:49 AM, Amit Kapila wrote: > On Fri, Mar 24, 2017 at 12:25 PM, Ashutosh Sharma wrote: >>> >>> I think it would have been probably okay to use *int* for ntuples as >>> that matches with what you are actually assigning in the function. >> >> okay, corrected it. Attached is newer version of patch. > > Thanks, this version looks good to me. Committed. Thank you.
Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint
On Mar 28, 2017 00:00, "Andreas Seltenreich" wrote: Ashutosh Sharma writes: >> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock", File: "bufmgr.c", Line: 3397) > Thanks for reporting this problem. Could you please let me know on for > how long did you run sqlsmith to get this crash. I got about one TRAP per hour when testing on 20 nodes with one postgres and 5 sqlsmithes on each. Nodes are tiny consumer machines with low-power 4-core sandy bridges. Okay. Thanks for sharing this information. I tried running running sqlsmith on a single node. I ran it for an hour but didn't find any crash. Let me also try running multiple sqlsmith at a time...May be 5 like you are doing on a single node. > [2. reacquire_lock_hashkillitems_if_required.patch] I'll test with your patch applied as soon as time permits and report back. Sure, Thanks a lot. With Regards, Ashutosh Sharma
Re: [HACKERS] Page Scan Mode in Hash Index
I think you should consider refactoring this so that it doesn't need >> to use goto. Maybe move the while (offnum <= maxoff) logic into a >> helper function and have it return itemIndex. If itemIndex == 0, you >> can call it again. >> > > okay, Added a helper function for _hash_readpage(). Please check v4 > patch attached with this mail. > > >> > This is not a full review, but I'm out of time for the moment. >> > > No worries. I will be ready for your further review comments any time. > Thanks for the review. > > This patch needs a rebase. Please try applying these patches on top of [1]. I think you should be able to apply it cleanly. Sorry, I think I forgot to mention this point in my earlier mail. [1] - https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com
[HACKERS] inconsistent page found on STANDBY server
Hi All, When running make installcheck on a master with wal consistency check enabled, inconsistent page is detected on standby. I could see the following FATAL message in the standby server logfile, 2017-03-30 07:31:10.101 BST [27994] LOG: entering standby mode 2017-03-30 07:31:10.106 BST [27994] LOG: redo starts at 0/224 2017-03-30 07:31:10.108 BST [27994] LOG: consistent recovery state reached at 0/2E4 2017-03-30 07:31:10.108 BST [27992] LOG: database system is ready to accept read only connections 2017-03-30 07:31:10.113 BST [27998] LOG: started streaming WAL from primary at 0/300 on timeline 1 2017-03-30 07:33:19.040 BST [27994] FATAL: inconsistent page found, rel 1663/13157/16391, forknum 0, blkno 0 2017-03-30 07:33:19.040 BST [27994] CONTEXT: WAL redo at 0/351CF03C for Hash/UPDATE_META_PAGE: ntuples -nan 2017-03-30 07:33:19.041 BST [27992] LOG: startup process (PID 27994) exited with exit code 1 Steps to reproduce: === 1)PG v10 sources 2)Setup Master/SLAVE replication 3)run make installcheck on Master 4)Check database logs ,generated on SLAVE directory. Please note that above issue is observed only on 32 bit LInux machine and was offlist reported to me by Tushar Ahuja. Tushar also allowed me to use his 32 bit Linux machine to analyse this problem. I also had a small offlist discussion with Amit (included in this mail) when analysing this problem. RCA: After debugging the hash index code for deletion, I could find that while registering data for xlog record 'XLOG_HASH_UPDATE_META_PAGE' we are not passing the correct length of data being registered and therefore, data (xl_hash_update_meta_page) is not completely recorded into the wal record. Fix: === Attached patch fixes this issue. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 34cc08f..a5798bd 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -691,7 +691,7 @@ loop_top: xlrec.ntuples = metap->hashm_ntuples; XLogBeginInsert(); - XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage)); + XLogRegisterData((char *) &xlrec, SizeOfHashUpdateMetaPage); XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)
Hi, On Wed, Mar 29, 2017 at 8:38 PM, Tomas Vondra wrote: > > > On 03/24/2017 04:27 AM, Peter Eisentraut wrote: >> >> On 3/17/17 18:35, Tomas Vondra wrote: >>> >>> On 03/17/2017 05:23 PM, Peter Eisentraut wrote: >>>> >>>> I'm struggling to find a good way to share code between >>>> bt_page_items(text, int4) and bt_page_items(bytea). >>>> >>>> If we do it via the SQL route, as I had suggested, it makes the >>>> extension non-relocatable, and it will also create a bit of a mess >>>> during upgrades. >>>> >>>> If doing it in C, it will be a bit tricky to pass the SRF context >>>> around. There is no "DirectFunctionCall within SRF context", AFAICT. >>> >>> >>> Not sure what it has to do with DirectFunctionCall? You want to call the >>> bytea variant from the existing one? Wouldn't it be easier to simply >>> define a static function with the shared parts, and pass around the >>> fctx/fcinfo? Not quite pretty, but should work. >> >> >> Perhaps what was added in >> >> <http://git.postgresql.org/pg/commitdiff/29bf5016835a2c2c23786f7cda347716c083d95f> >> would actually work here. >> > > I've tried to refactor the code using this, but the result was rather ugly > because (a) it really is quite tricky to pass around the contexts and (b) > the sanity checks are quite different for the two input types, so mixing > those parts (essentially the SRF_IS_FIRSTCALL bits) does not make much sense > IMHO. > > The attached patch is the best I came up with - it essentially shares just > the tuple-forming part, which is exactly the same in both cases. > > It also adds the P_ISMETA(opaque) check to the original function, which > seems like a useful defense against page written to a different place (which > is essentially the issue I was originally investigating). > It seems like the latest patch(v4) shared by Tomas upthread is an empty patch. If I am not wrong, please share the correct patch. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint
On Sat, Apr 1, 2017 at 1:08 AM, Robert Haas wrote: > On Mon, Mar 27, 2017 at 5:39 AM, Ashutosh Sharma > wrote: >> Thanks for reporting this problem. Could you please let me know on for >> how long did you run sqlsmith to get this crash. However, I have found >> the reason for this crash. This is basically happening when trying to >> retrieve the tuples using cursor. Basically the current hash index >> scan work tuple-at-a-time which means once it finds tuple on page, it >> releases lock from the page but keeps pin on it and finally returns >> the tuple. When the requested number of tuples are processed there is >> no lock on the page that was being scanned but yes there is a pin on >> it. Finally, when trying to close a cursor at the end of scan, if any >> killed tuples has been identified we try to first mark these items as >> dead with the help of _hash_kill_items(). But, since we only have pin >> on this page, the assert check 'LWLockHeldByMe()' fails. > > Instead of adding bool haveLock to _hash_kill_items, how about just > having the caller acquire the lock before calling the function and > release it afterwards? Since the acquire is at the beginning of the > function and the release at the end, there seems to be no point in > putting the acquire/release inside the function rather than in the > caller. Well, That is another option but the main idea was to be inline with the btree code. Moreover, I am not sure if acquiring lwlock inside hashendscan (basically the place where we are trying to close down the things) would look good. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint
On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas wrote: > On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma > wrote: >> Well, That is another option but the main idea was to be inline with >> the btree code. > > That's not a bad goal in principal, but _bt_killitems() doesn't have > any similar argument. It was there but later got removed with some patch (may be the patch for reducing pinning and buffer content lock for btree scans). The following commit has this changes. commit 09cb5c0e7d6fbc9dee26dc429e4fc0f2a88e5272 Author: Tom Lane Date: Sun May 7 01:21:30 2006 + > > (Also, speaking of consistency, why did we end up with > _hash_kill_items, with an underscore between kill and items, and > _bt_killitems, without one?) That is just to follow the naming convention in hash.h Please check the review comments for this at - [1]. > >> Moreover, I am not sure if acquiring lwlock inside >> hashendscan (basically the place where we are trying to close down the >> things) would look good. > > Well, if that's not a good thing to do, hiding it inside some other > function doesn't make it better. okay, agreed. I will submit the patch very shortly. [1] - https://www.postgresql.org/message-id/cf6abd8a-77b5-f1c7-8e50-5bef461e0522%40redhat.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint
Hi, On Sat, Apr 1, 2017 at 7:06 AM, Ashutosh Sharma wrote: > On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas wrote: >> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma >> wrote: >>> Well, That is another option but the main idea was to be inline with >>> the btree code. >> >> That's not a bad goal in principal, but _bt_killitems() doesn't have >> any similar argument. > > It was there but later got removed with some patch (may be the patch > for reducing pinning and buffer content lock for btree scans). The > following commit has this changes. > > commit 09cb5c0e7d6fbc9dee26dc429e4fc0f2a88e5272 > Author: Tom Lane > Date: Sun May 7 01:21:30 2006 + > >> >> (Also, speaking of consistency, why did we end up with >> _hash_kill_items, with an underscore between kill and items, and >> _bt_killitems, without one?) > > That is just to follow the naming convention in hash.h Please check > the review comments for this at - [1]. > >> >>> Moreover, I am not sure if acquiring lwlock inside >>> hashendscan (basically the place where we are trying to close down the >>> things) would look good. >> >> Well, if that's not a good thing to do, hiding it inside some other >> function doesn't make it better. > > okay, agreed. I will submit the patch very shortly. As suggested, I am now acquiring lock inside the caller function. Attached is the patch having this changes. Thanks. > > > [1] - > https://www.postgresql.org/message-id/cf6abd8a-77b5-f1c7-8e50-5bef461e0522%40redhat.com -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index 34cc08f..b835f77 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -476,9 +476,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys, HashScanOpaque so = (HashScanOpaque) scan->opaque; Relation rel = scan->indexRelation; - /* Before leaving current page, deal with any killed items */ + /* + * Before leaving current page, deal with any killed items. + * Also, ensure that we acquire lock on current page before + * calling _hash_kill_items. + */ if (so->numKilled > 0) + { + LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE); _hash_kill_items(scan); + LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK); + } _hash_dropscanbuf(rel, so); @@ -507,9 +515,17 @@ hashendscan(IndexScanDesc scan) HashScanOpaque so = (HashScanOpaque) scan->opaque; Relation rel = scan->indexRelation; - /* Before leaving current page, deal with any killed items */ + /* + * Before leaving current page, deal with any killed items. + * Also, ensure that we acquire lock on current page before + * calling _hash_kill_items. + */ if (so->numKilled > 0) + { + LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE); _hash_kill_items(scan); + LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK); + } _hash_dropscanbuf(rel, so); -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Page Scan Mode in Hash Index
Hi, > > On 03/29/2017 09:16 PM, Ashutosh Sharma wrote: >>> >>> This patch needs a rebase. >> >> >> Please try applying these patches on top of [1]. I think you should be >> able >> to apply it cleanly. Sorry, I think I forgot to mention this point in my >> earlier mail. >> >> [1] - >> >> https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com >> > > Thanks, that works ! > > As you have provided a patch for Robert's comments, and no other review have > been posted I'm moving this patch to "Ready for Committer" for additional > committer feedback. Please find the attached new version of patches for page scan mode in hash index rebased on top of - [1]. [1] - https://www.postgresql.org/message-id/CAE9k0P%3D3rtgUtxopG%2BXQVxgASFzAnGd9sNmYUaj_%3DKeVsKGUdA%40mail.gmail.com From 498723199f4b14ff9917aca13abf30f9ea261ca7 Mon Sep 17 00:00:00 2001 From: ashu Date: Sat, 1 Apr 2017 12:09:46 +0530 Subject: [PATCH] Rewrite hash index scans to work a page at a timev5 Patch by Ashutosh Sharma --- src/backend/access/hash/README | 9 +- src/backend/access/hash/hash.c | 140 ++- src/backend/access/hash/hashpage.c | 17 +- src/backend/access/hash/hashsearch.c | 441 +++ src/backend/access/hash/hashutil.c | 29 ++- src/include/access/hash.h| 44 6 files changed, 509 insertions(+), 171 deletions(-) diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README index 1541438..f0a7bdf 100644 --- a/src/backend/access/hash/README +++ b/src/backend/access/hash/README @@ -243,10 +243,11 @@ The reader algorithm is: -- then, per read request: reacquire content lock on current page step to next page if necessary (no chaining of content locks, but keep - the pin on the primary bucket throughout the scan; we also maintain - a pin on the page currently being scanned) - get tuple - release content lock + the pin on the primary bucket throughout the scan) +save all the matching tuples from current index page into an items array +release pin and content lock (but if it is primary bucket page retain +it's pin till the end of scan) +get tuple from an item array -- at scan shutdown: release all pins still held diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c index b835f77..bd2827a 100644 --- a/src/backend/access/hash/hash.c +++ b/src/backend/access/hash/hash.c @@ -268,66 +268,23 @@ bool hashgettuple(IndexScanDesc scan, ScanDirection dir) { HashScanOpaque so = (HashScanOpaque) scan->opaque; - Relation rel = scan->indexRelation; - Buffer buf; - Page page; OffsetNumber offnum; - ItemPointer current; bool res; + HashScanPosItem *currItem; /* Hash indexes are always lossy since we store only the hash code */ scan->xs_recheck = true; /* - * We hold pin but not lock on current buffer while outside the hash AM. - * Reacquire the read lock here. - */ - if (BufferIsValid(so->hashso_curbuf)) - LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE); - - /* * If we've already initialized this scan, we can just advance it in the * appropriate direction. If we haven't done so yet, we call a routine to * get the first item in the scan. */ - current = &(so->hashso_curpos); - if (ItemPointerIsValid(current)) + if (!HashScanPosIsValid(so->currPos)) + res = _hash_first(scan, dir); + else { /* - * An insertion into the current index page could have happened while - * we didn't have read lock on it. Re-find our position by looking - * for the TID we previously returned. (Because we hold a pin on the - * primary bucket page, no deletions or splits could have occurred; - * therefore we can expect that the TID still exists in the current - * index page, at an offset >= where we were.) - */ - OffsetNumber maxoffnum; - - buf = so->hashso_curbuf; - Assert(BufferIsValid(buf)); - page = BufferGetPage(buf); - - /* - * We don't need test for old snapshot here as the current buffer is - * pinned, so vacuum can't clean the page. - */ - maxoffnum = PageGetMaxOffsetNumber(page); - for (offnum = ItemPointerGetOffsetNumber(current); - offnum <= maxoffnum; - offnum = OffsetNumberNext(offnum)) - { - IndexTuple itup; - - itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum)); - if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid))) -break; - } - if (offnum > maxoffnum) - elog(ERROR, "failed to re-find scan position within index \"%s\"", - RelationGetRelationName(rel)); - ItemPointerSetOffsetNumber(current, offnum); - - /* * Check to see if we should kill the previously-fetched tuple. */ if (scan->kill_prior_tup
Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)
Hi Tomas, On Fri, Mar 31, 2017 at 11:05 PM, Tomas Vondra wrote: > On 03/31/2017 06:01 PM, Ashutosh Sharma wrote: >> >> >> It seems like the latest patch(v4) shared by Tomas upthread is an >> empty patch. If I am not wrong, please share the correct patch. >> Thanks. >> > > OMG, this is the second time I managed to generate an empty patch. I really > need to learn not to do that .. > > Attached is v5 of the patch, hopefully correct this time ;-) Yes, it was correct. I have spent some time on reviewing your patch today and the review comments are as follows. 1. It would be good to have a blank line in between following set of lines describing bt_page_print_tuples. +/* + * bt_page_print_tuples + * form a tuple describing index tuple at a given offset + */ Something like this, +/* - + * bt_page_print_tuples() + * + * Form a tuple describing index tuple at a given offset + * - + */ Please note that, if you do not agree with my suggestion, you may ignore it :) 2. I think the following comments for bt_page_items needs to be moved to the right place in the code. /*- * bt_page_items() * * Get IndexTupleData set in a btree page * * Usage: SELECT * FROM bt_page_items('t1_pkey', 1); *- */ /* * cross-call data structure for SRF */ *struct user_args{Pagepage;OffsetNumber offset;};* With your patch, above structure definition is followed by the definition for bt_page_print_tuples() not bt_page_items(), hence you may need to put the comments for bt_page_items just above it's definition. 3. I am seeing a server crash when running the sql statement 'SELECT * FROM bt_page_items('bt_i4_index', 1);'. Here is the backtrace. I think this crash is happening because in bt_page_items, without reading opaque region of a page, you have added this if statement, *if (P_ISMETA(opaque))ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("block is a meta page")));* Please pull back above if statement below the following line to get rid of this crash. opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page); OR, the best thing to do would be to retain the earlier if statement for checking meta page because that way you can avoid reading a buffer unnecessarily. #0 0x7f592975030c in bt_page_items (fcinfo=0x7fffe5fdb860) at btreefuncs.c:352 352 if (P_ISMETA(opaque)) Missing separate debuginfos, use: debuginfo-install glibc-2.17-78.el7.x86_64 (gdb) bt #0 0x7f592975030c in bt_page_items (fcinfo=0x7fffe5fdb860) at btreefuncs.c:352 #1 0x006797e0 in ExecMakeTableFunctionResult (setexpr=0x21269f8, econtext=0x2126738, argContext=0x2136c98, expectedDesc=0x2155178, randomAccess=0 '\000') at execSRF.c:230 #2 0x00689dda in FunctionNext (node=0x2126628) at nodeFunctionscan.c:94 #3 0x00678f0e in ExecScanFetch (node=0x2126628, accessMtd=0x689d23 , recheckMtd=0x68a108 ) at execScan.c:95 #4 0x00678f7d in ExecScan (node=0x2126628, accessMtd=0x689d23 , recheckMtd=0x68a108 ) at execScan.c:143 4. If you agree with the reason mentioned by me in comment #3, please remove the following if statement from bt_page_items(), *if (P_ISMETA(opaque))ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("block is a meta page")));* Apart from above comments, your patch looks good to me. I have also marked this patch as 'Waiting for Author' in the commitfest. Thanks. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com