from:"Ashutosh Sharma"

[HACKERS] Parallel query fails on standby server

2016-03-08 Thread Ashutosh Sharma

Hi All,

While testing a parallel scan feature on standby server, it is found that
the parallel query fails with an error "*ERROR:  failed to initialize
transaction_read_only to 0*".

Following are the steps used to reproduce the issue:

*Master :-*

edb=# create table ert(n int);
edb=# insert into  ert values (generate_series(1,500));
edb=# analyze ert;
edb=# vacuum ert;

*Slave :-*

edb=# set max_parallel_degree =5;
SET
edb=# explain analyze verbose  select * from ert where n<=1000;
ERROR:  failed to initialize transaction_read_only to 0
CONTEXT:  parallel worker, PID 26042

*Root cause Analysis:*  After debugging the worker, it is observed that in
*RestoreGUCState()*, if a guc var can't be skipped it is Initialiazed with
a default value and
in this process when a guc variable "*transaction_read_only*" is being
Initialzed it calls a check_hook *check_transaction_read_only()* which
eventually fails due to
below check which says the guc var "*transaction_read_only*" can't be set
while recovery is in progress:






*if
(RecoveryInProgress()){GUC_check_errcode(ERRCODE_FEATURE_NOT_SUPPORTED);GUC_check_errmsg("cannot
set transaction read-write mode during recovery");return false;}*

*Solution:* Make use of a global variable "*InitializingParallelWorker"* to
protect the check for *RecoveryInProgress()* when Parallel Worker is being
Initialsed.
PFA patch to fix the issue.

With Regards,
Ashutosh Sharma
EnterpriseDB: *http://www.enterprisedb.com <http://www.enterprisedb.com>*
diff --git a/src/backend/commands/variable.c b/src/backend/commands/variable.c
index 903b3a6..c7173c9 100644
--- a/src/backend/commands/variable.c
+++ b/src/backend/commands/variable.c
@@ -482,11 +482,13 @@ show_log_timezone(void)
  * nothing since XactReadOnly will be reset by the next StartTransaction().
  * The IsTransactionState() test protects us against trying to check
  * RecoveryInProgress() in contexts where shared memory is not accessible.
+ * We can also skip the check for RecoveryInProgress while initializing the
+ * Parallel Workers by making use of the global variable InitializingParallelWorker.
  */
 bool
 check_transaction_read_only(bool *newval, void **extra, GucSource source)
 {
-	if (*newval == false && XactReadOnly && IsTransactionState())
+	if (*newval == false && XactReadOnly && IsTransactionState() && !InitializingParallelWorker)
 	{
 		/* Can't go to r/w mode inside a r/o transaction */
 		if (IsSubTransaction())

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance degradation in commit 6150a1b0

2016-03-23 Thread Ashutosh Sharma

Hi All,

I have been working on this issue for last few days trying to investigate
what could be the probable reasons for Performance degradation at commit
6150a1b0. After going through Andres patch for moving buffer I/O and
content lock out of Main Tranche, the following two things come into my
mind.

1. Content Lock is no more used as a pointer in BufferDesc structure
instead it is included as LWLock structure. This  basically increases the
overall structure size from 64bytes to 80 bytes. Just to investigate on
this, I have reverted the changes related to content lock from commit
6150a1b0 and taken at least 10 readings and with this change i can see that
the overall performance is similar to what it was observed earlier i.e.
before commit 6150a1b0.

2. Secondly, i can see that the BufferDesc structure padding is 64 bytes
however the PG CACHE LINE ALIGNMENT is 128 bytes. Also, after changing the
BufferDesc structure padding size to 128 bytes along with the changes
mentioned in above point #1, I see that the overall performance is again
similar to what is observed before commit 6150a1b0.

Please have a look into the attached test report that contains the
performance test results for all the scenarios discussed above and let me
know your thoughts.

With Regards,
Ashutosh Sharma
EnterpriseDB: *http://www.enterprisedb.com <http://www.enterprisedb.com>*

On Sat, Feb 27, 2016 at 9:26 AM, Andres Freund  wrote:

> On February 26, 2016 7:55:18 PM PST, Amit Kapila 
> wrote:
> >On Sat, Feb 27, 2016 at 12:41 AM, Andres Freund 
> >wrote:
> >>
> >> Hi,
> >>
> >> On 2016-02-25 12:56:39 +0530, Amit Kapila wrote:
> >> > From past few weeks, we were facing some performance degradation in
> >the
> >> > read-only performance bench marks in high-end machines.  My
> >colleague
> >> > Mithun, has tried by reverting commit ac1d794 which seems to
> >degrade the
> >> > performance in HEAD on high-end m/c's as reported previously[1],
> >but
> >still
> >> > we were getting degradation, then we have done some profiling to
> >see
> >what
> >> > has caused it  and we found that it's mainly caused by spin lock
> >when
> >> > called via pin/unpin buffer and then we tried by reverting commit
> >6150a1b0
> >> > which has recently changed the structures in that area and it turns
> >out
> >> > that reverting that patch, we don't see any degradation in
> >performance.
> >> > The important point to note is that the performance degradation
> >doesn't
> >> > occur every time, but if the tests are repeated twice or thrice, it
> >> > is easily visible.
> >>
> >> > m/c details
> >> > IBM POWER-8
> >> > 24 cores,192 hardware threads
> >> > RAM - 492GB
> >> >
> >> > Non-default postgresql.conf settings-
> >> > shared_buffers=16GB
> >> > max_connections=200
> >> > min_wal_size=15GB
> >> > max_wal_size=20GB
> >> > checkpoint_timeout=900
> >> > maintenance_work_mem=1GB
> >> > checkpoint_completion_target=0.9
> >> >
> >> > scale_factor - 300
> >> >
> >> > Performance at commit 43cd468cf01007f39312af05c4c92ceb6de8afd8 is
> >469002 at
> >> > 64-client count and then at
> >6150a1b08a9fe7ead2b25240be46dddeae9d98e1, it
> >> > went down to 200807.  This performance numbers are median of 3
> >15-min
> >> > pgbench read-only tests.  The similar data is seen even when we
> >revert
> >the
> >> > patch on latest commit.  We have yet to perform detail analysis as
> >to
> >why
> >> > the commit 6150a1b08a9fe7ead2b25240be46dddeae9d98e1 lead to
> >degradation,
> >> > but any ideas are welcome.
> >>
> >> Ugh. Especially the varying performance is odd. Does it vary between
> >> restarts, or is it just happenstance?  If it's the former, we might
> >be
> >> dealing with some alignment issues.
> >>
> >
> >It varies between restarts.
> >
> >>
> >> If not, I wonder if the issue is massive buffer header contention. As
> >a
> >> LL/SC architecture acquiring the content lock might interrupt buffer
> >> spinlock acquisition and vice versa.
> >>
> >> Does applying the patch from
> >
> http://archives.postgresql.org/message-id/CAPpHfdu77FUi5eiNb%2BjRPFh5S%2B1U%2B8ax4Zw%3DAUYgt%2BCPsKiyWw%40mail.gmail.com
> >> change the picture?
> >>
> >
> >Not tried, but if this is alignment issue as you are suspecting above,
> >then
> >does it make sense to try this out?
>
> It's the other theory I had. And it's additionally useful testing
> regardless of this regression...
>
> ---
> Please excuse brevity and formatting - I am writing this on my mobile
> phone.
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


Performance_Results.xlsx
Description: MS-Excel 2007 spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance degradation in commit 6150a1b0

2016-03-26 Thread Ashutosh Sharma

Hi,

I am getting some reject files while trying to apply "*pinunpin-cas-5.patch*"
attached with the thread,



*http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com
<http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com>*
Note: I am applying this patch on top of commit "
*6150a1b08a9fe7ead2b25240be46dddeae9d98e1*".

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Fri, Mar 25, 2016 at 9:29 AM, Amit Kapila 
wrote:

> On Wed, Mar 23, 2016 at 1:59 PM, Ashutosh Sharma 
> wrote:
> >
> > Hi All,
> >
> > I have been working on this issue for last few days trying to
> investigate what could be the probable reasons for Performance degradation
> at commit 6150a1b0. After going through Andres patch for moving buffer I/O
> and content lock out of Main Tranche, the following two things come into my
> > mind.
> >
> > 1. Content Lock is no more used as a pointer in BufferDesc structure
> instead it is included as LWLock structure. This  basically increases the
> overall structure size from 64bytes to 80 bytes. Just to investigate on
> this, I have reverted the changes related to content lock from commit
> 6150a1b0 and taken at least 10 readings and with this change i can see that
> the overall performance is similar to what it was observed earlier i.e.
> before commit 6150a1b0.
> >
> > 2. Secondly, i can see that the BufferDesc structure padding is 64 bytes
> however the PG CACHE LINE ALIGNMENT is 128 bytes. Also, after changing the
> BufferDesc structure padding size to 128 bytes along with the changes
> mentioned in above point #1, I see that the overall performance is again
> similar to what is observed before commit 6150a1b0.
> >
> > Please have a look into the attached test report that contains the
> performance test results for all the scenarios discussed above and let me
> know your thoughts.
> >
>
> So this indicates that changing back content lock as LWLock* in BufferDesc
> brings back the performance which indicates that increase in BufferDesc
> size to more than 64bytes on this platform has caused regression.  I think
> it is worth trying the patch [1] as suggested by Andres as that will reduce
> the size of BufferDesc which can bring back the performance.  Can you once
> try the same?
>
> [1] -
> http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com
>
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [HACKERS] Performance degradation in commit 6150a1b0

2016-03-26 Thread Ashutosh Sharma

Hi,

As mentioned in my earlier mail i was not able to apply
*pinunpin-cas-5.patch* on commit *6150a1b0, *therefore i thought of
applying it on the
latest commit and i was able to do it successfully. I have now taken the
performance readings at latest commit i.e. *76281aa9* with and without
applying *pinunpin-cas-5.patch* and my observations are as follows,

1. I can still see that the current performance lags by 2-3% from the
expected performance when *pinunpin-cas-5.patch *is applied on the commit

*76281aa9.*
2. When *pinunpin-cas-5.patch *is ignored and performance is measured at
commit *76281aa9 *the overall performance lags by 50-60% from the expected
performance.

*Note:* Here, the expected performance is the performance observed before
commit *6150a1b0 *when* ac1d794 *is reverted.

Please refer to the attached performance report sheet for more insights.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com <http://www.enterprisedb.com/>

On Sat, Mar 26, 2016 at 9:31 PM, Ashutosh Sharma 
wrote:

> Hi,
>
> I am getting some reject files while trying to apply "
> *pinunpin-cas-5.patch*" attached with the thread,
>
>
>
> *http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com
> <http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com>*
> Note: I am applying this patch on top of commit "
> *6150a1b08a9fe7ead2b25240be46dddeae9d98e1*".
>
> With Regards,
> Ashutosh Sharma
> EnterpriseDB: http://www.enterprisedb.com
>
> On Fri, Mar 25, 2016 at 9:29 AM, Amit Kapila 
> wrote:
>
>> On Wed, Mar 23, 2016 at 1:59 PM, Ashutosh Sharma 
>> wrote:
>> >
>> > Hi All,
>> >
>> > I have been working on this issue for last few days trying to
>> investigate what could be the probable reasons for Performance degradation
>> at commit 6150a1b0. After going through Andres patch for moving buffer I/O
>> and content lock out of Main Tranche, the following two things come into my
>> > mind.
>> >
>> > 1. Content Lock is no more used as a pointer in BufferDesc structure
>> instead it is included as LWLock structure. This  basically increases the
>> overall structure size from 64bytes to 80 bytes. Just to investigate on
>> this, I have reverted the changes related to content lock from commit
>> 6150a1b0 and taken at least 10 readings and with this change i can see that
>> the overall performance is similar to what it was observed earlier i.e.
>> before commit 6150a1b0.
>> >
>> > 2. Secondly, i can see that the BufferDesc structure padding is 64
>> bytes however the PG CACHE LINE ALIGNMENT is 128 bytes. Also, after
>> changing the BufferDesc structure padding size to 128 bytes along with the
>> changes mentioned in above point #1, I see that the overall performance is
>> again similar to what is observed before commit 6150a1b0.
>> >
>> > Please have a look into the attached test report that contains the
>> performance test results for all the scenarios discussed above and let me
>> know your thoughts.
>> >
>>
>> So this indicates that changing back content lock as LWLock* in
>> BufferDesc brings back the performance which indicates that increase in
>> BufferDesc size to more than 64bytes on this platform has caused
>> regression.  I think it is worth trying the patch [1] as suggested by
>> Andres as that will reduce the size of BufferDesc which can bring back the
>> performance.  Can you once try the same?
>>
>> [1] -
>> http://www.postgresql.org/message-id/capphfdsrot1jmsnrnccqpnzeu9vut7tx6b-n1wyouwwfhd6...@mail.gmail.com
>>
>> With Regards,
>> Amit Kapila.
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
>


Performance_Results_with_pinunpin_cas_changes.xlsx
Description: MS-Excel 2007 spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Performance degradation in commit 6150a1b0

2016-03-29 Thread Ashutosh Sharma

Hi,

I am unable to revert 6150a1b0 on top of recent commit in the master
branch. It seems like there has been some commit made recently that has got
dependency on 6150a1b0.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Sun, Mar 27, 2016 at 5:45 PM, Andres Freund  wrote:

> Hi,
>
> On 2016-03-27 02:34:32 +0530, Ashutosh Sharma wrote:
> > As mentioned in my earlier mail i was not able to apply
> > *pinunpin-cas-5.patch* on commit *6150a1b0,
>
> That's not surprising; that's pretty old.
>
> > *therefore i thought of applying it on the latest commit and i was
> > able to do it successfully. I have now taken the performance readings
> > at latest commit i.e. *76281aa9* with and without applying
> > *pinunpin-cas-5.patch* and my observations are as follows,
> >
>
> > 1. I can still see that the current performance lags by 2-3% from the
> > expected performance when *pinunpin-cas-5.patch *is applied on the commit
> >
> > *76281aa9.*
> > 2. When *pinunpin-cas-5.patch *is ignored and performance is measured at
> > commit *76281aa9 *the overall performance lags by 50-60% from the
> expected
> > performance.
> >
> > *Note:* Here, the expected performance is the performance observed before
> > commit *6150a1b0 *when* ac1d794 *is reverted.
>
> Thanks for doing these benchmarks. What's the performance if you revert
> 6150a1b0 on top of a recent master? There've been a lot of other patches
> influencing performance since 6150a1b0, so minor performance differences
> aren't necessarily meaningful; especially when that older version then
> had other patches reverted.
>
> Thanks,
>
> Andres
>

Re: [HACKERS] "Some tests to cover hash_index"

2016-08-22 Thread Ashutosh Sharma

Hi All,

I have reverified the code coverage for hash index code using the test file
(commit-hash_coverage_test) attached with this mailing list and have found
that some of the  code in _hash_squeezebucket() function flow is not being
covered. For this i have added a small testcase on top of 'commit
hash_coverage_test' patch.  I have done this mainly to test Amit's WAL for
hash index patch [1].

I have also removed the warning message that we used to get for hash index
like 'WARNING: hash indexes are not WAL-logged and their use is
discouraged' as this message is now no more visible w.r.t hash index after
the WAL patch for hash index. Please have a look and let me know your
thoughts.

[1] -
https://www.postgresql.org/message-id/CAA4eK1JOBX%3DYU33631Qh-XivYXtPSALh514%2BjR8XeD7v%2BK3r_Q%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Sat, Aug 6, 2016 at 9:41 AM, Amit Kapila  wrote:

> On Thu, Aug 4, 2016 at 7:24 PM, Mithun Cy 
> wrote:
>
>> I am attaching the patch to improve some coverage of hash index code [1].
>> I have added some basic tests, which mainly covers overflow pages. It
>> took 2 sec extra time in my machine in parallel schedule.
>>
>>
>>
>>
>> Hit Total Coverage
>> old tests Line Coverage 780 1478 52.7
>>
>> Function Coverage 63 85 74.1
>> improvement after tests Line Coverage 1181 1478 79.9 %
>>
>> Function Coverage 78 85 91.8 %
>>
>>
>
> I think the code coverage improvement for hash index with these tests
> seems to be quite good, however time for tests seems to be slightly on
> higher side.  Do anybody have better suggestion for these tests?
>
> diff --git a/src/test/regress/sql/concurrent_hash_index.sql
> b/src/test/regress/sql/concurrent_hash_index.sql
> I wonder why you have included a new file for these tests, why can't be
> these added to existing hash_index.sql.
>
> +--
> +-- Cause some overflow insert and splits.
> +--
> +CREATE TABLE con_hash_index_table (keycol INT);
> +CREATE INDEX con_hash_index on con_hash_index_table USING HASH (keycol);
>
> The relation name con_hash_index* choosen in above tests doesn't seem to
> be appropriate, how about hash_split_heap* or something like that.
>
> Register your patch in latest CF (https://commitfest.postgresql.org/10/)
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>

Re: [HACKERS] "Some tests to cover hash_index"

2016-08-22 Thread Ashutosh Sharma

Hi,

I missed to attach the patch in my previous mail. Here i attach the patch.

With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Tue, Aug 23, 2016 at 11:47 AM, Ashutosh Sharma 
wrote:

> Hi All,
>
> I have reverified the code coverage for hash index code using the test
> file (commit-hash_coverage_test) attached with this mailing list and have
> found that some of the  code in _hash_squeezebucket() function flow is not
> being covered. For this i have added a small testcase on top of 'commit
> hash_coverage_test' patch.  I have done this mainly to test Amit's WAL for
> hash index patch [1].
>
> I have also removed the warning message that we used to get for hash index
> like 'WARNING: hash indexes are not WAL-logged and their use is
> discouraged' as this message is now no more visible w.r.t hash index after
> the WAL patch for hash index. Please have a look and let me know your
> thoughts.
>
> [1] - https://www.postgresql.org/message-id/CAA4eK1JOBX%
> 3DYU33631Qh-XivYXtPSALh514%2BjR8XeD7v%2BK3r_Q%40mail.gmail.com
>
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
> On Sat, Aug 6, 2016 at 9:41 AM, Amit Kapila 
> wrote:
>
>> On Thu, Aug 4, 2016 at 7:24 PM, Mithun Cy 
>> wrote:
>>
>>> I am attaching the patch to improve some coverage of hash index code [1].
>>> I have added some basic tests, which mainly covers overflow pages. It
>>> took 2 sec extra time in my machine in parallel schedule.
>>>
>>>
>>>
>>>
>>> Hit Total Coverage
>>> old tests Line Coverage 780 1478 52.7
>>>
>>> Function Coverage 63 85 74.1
>>> improvement after tests Line Coverage 1181 1478 79.9 %
>>>
>>> Function Coverage 78 85 91.8 %
>>>
>>>
>>
>> I think the code coverage improvement for hash index with these tests
>> seems to be quite good, however time for tests seems to be slightly on
>> higher side.  Do anybody have better suggestion for these tests?
>>
>> diff --git a/src/test/regress/sql/concurrent_hash_index.sql
>> b/src/test/regress/sql/concurrent_hash_index.sql
>> I wonder why you have included a new file for these tests, why can't be
>> these added to existing hash_index.sql.
>>
>> +--
>> +-- Cause some overflow insert and splits.
>> +--
>> +CREATE TABLE con_hash_index_table (keycol INT);
>> +CREATE INDEX con_hash_index on con_hash_index_table USING HASH (keycol);
>>
>> The relation name con_hash_index* choosen in above tests doesn't seem to
>> be appropriate, how about hash_split_heap* or something like that.
>>
>> Register your patch in latest CF (https://commitfest.postgresql.org/10/)
>>
>> --
>> With Regards,
>> Amit Kapila.
>> EnterpriseDB: http://www.enterprisedb.com
>>
>
>
diff --git a/src/test/regress/expected/concurrent_hash_index.out b/src/test/regress/expected/concurrent_hash_index.out
index c3b8036..60191c0 100644
--- a/src/test/regress/expected/concurrent_hash_index.out
+++ b/src/test/regress/expected/concurrent_hash_index.out
@@ -3,7 +3,6 @@
 --
 CREATE TABLE con_hash_index_table (keycol INT);
 CREATE INDEX con_hash_index on con_hash_index_table USING HASH (keycol);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 INSERT INTO con_hash_index_table VALUES (1);
 INSERT INTO con_hash_index_table SELECT * from con_hash_index_table;
 INSERT INTO con_hash_index_table SELECT * from con_hash_index_table;
@@ -75,5 +74,15 @@ DROP TABLE hash_ovfl_temp_heap CASCADE;
 CREATE TABLE hash_ovfl_heap_float4 (x float4, y int);
 INSERT INTO hash_ovfl_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_ovfl_heap_float4 USING hash (x);
-WARNING:  hash indexes are not WAL-logged and their use is discouraged
 DROP TABLE hash_ovfl_heap_float4 CASCADE;
+--
+-- Test hash index insertion with squeeze bucket (XLOG_HASH_MOVE_PAGE_CONTENTS
+-- WAL record type).
+--
+CREATE TABLE hash_split_buckets (seqno int4, random int4);
+CREATE INDEX hash_idx ON hash_split_buckets USING hash (random int4_ops)
+with (fillfactor = 10);
+INSERT INTO hash_split_buckets (seqno, random) SELECT a, a*5 FROM
+GENERATE_SERIES(1, 10) a;
+REINDEX INDEX hash_idx;
+DROP TABLE hash_split_buckets;
diff --git a/src/test/regress/sql/concurrent_hash_index.sql b/src/test/regress/sql/concurrent_hash_index.sql
index 8f930d5..d3b09d0 100644
--- a/src/test/regress/sql/concurrent_hash_index.sql
+++ b/src/test/regress/sql/concurrent_hash_index.sql
@@ -78,3 +78,15 @@ CREATE TABLE hash_ovfl_heap_float4 (x float4, y int);
 INSERT INTO hash_ovfl_heap_float4 VALUES (1.1,1);
 CREATE INDEX hash_idx ON hash_ovfl_heap_float4 USING hash (x);
 DROP TABLE hash_ovfl_heap_float4

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-08-22 Thread Ashutosh Sharma

Hi All,

Following are the steps that i have followed to verify the WAL Logging
of hash index,

1. I used Mithun's patch to improve coverage of hash index code [1] to
verify the WAL Logging of hash index. Firstly i have confirmed if all
the XLOG records associated with hash index are being covered or not
using this patch. In case if any of the XLOG record for hash index
operation is not being covered i have added a testcase for it. I have
found that one of the XLOG record 'XLOG_HASH_MOVE_PAGE_CONTENTS' was
not being covered and added a small testcase for the same. The patch
for this is available @ [2].

2. I executed the regression test suite and found all the hash indexes
that are getting created as a part of regression test suite using the
below query.

SELECT t.relname index_name, t.oid FROM pg_class t JOIN pg_am idx
ON idx.oid = t.relam WHERE idx.amname = 'hash';

3. Thirdly, I have calculated the number of pages associated with each
hash index and compared every page of hash index on master and standby
server using pg_filedump tool. As for example if the number of pages
associated with 'con_hash_index' is 10 then here is what i did,

On master:
-
select pg_relation_filepath('con_hash_index');
pg_relation_filepath
--
base/16408/16433
(1 row)

./pg_filedump -if -R 0 9
/home/edb/git-clone-postgresql/postgresql/TMP/postgres/master/base/16408/16433
> /tmp/file1

On Slave:
---
select pg_relation_filepath('con_hash_index');
pg_relation_filepath
--
base/16408/16433
(1 row)

./pg_filedump -if -R 0 9
/home/edb/git-clone-postgresql/postgresql/TMP/postgres/standby/base/16408/16433
> /tmp/file2

compared file1 and file2 using some diff tool.

Following are the list of hash indexes that got created inside
regression database when regression test suite was executed on a
master server.

hash_i4_index
hash_name_index
hash_txt_index
hash_f8_index
con_hash_index
hash_idx

In short, this is all i did and found no issues during testing. Please
let me know if you need any further details.

I would like to Thank Amit for his support and guidance during the
testing phase.

[1] -
https://www.postgresql.org/message-id/CAA4eK1JOBX%3DYU33631Qh-XivYXtPSALh514%2BjR8XeD7v%2BK3r_Q%40mail.gmail.com
[2] -
https://www.postgresql.org/message-id/CAE9k0PkNjryhSiG53mjnKFhi%2BMipJMjSa%3DYkH-UeW3bfr1HPJQ%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] "Some tests to cover hash_index"

2016-08-23 Thread Ashutosh Sharma

Hi,

> Well, that change should be part of Amit's patch to add WAL logging,
> not this patch, whose mission is just to improve test coverage.

I have just removed the warning message from expected output file as i
have performed the testing on Amit's latest patch that removes this
warning message from the  hash index code.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-07 Thread Ashutosh Sharma

> Thanks to Ashutosh Sharma for doing the testing of the patch and
> helping me in analyzing some of the above issues.

Hi All,

I would like to summarize the test-cases that i have executed for
validating WAL logging in hash index feature.

1) I have mainly ran the pgbench test with read-write workload at the
scale factor of 1000 and various client counts like 16, 64 and 128 for
time duration of 30 mins, 1 hr and 24 hrs. I have executed this test
on highly configured power2 machine with 128 cores and 512GB of RAM. I
ran the test-case both with and without the replication setup.

Please note that i have changed the schema of pgbench tables created
during initialisation phase.

The new schema of pgbench tables looks as shown below on both master
and standby:

postgres=# \d pgbench_accounts
   Table "public.pgbench_accounts"
  Column  | Type  | Modifiers
--+---+---
 aid  | integer   | not null
 bid  | integer   |
 abalance | integer   |
 filler   | character(84) |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_bid" hash (bid)

postgres=# \d pgbench_history
  Table "public.pgbench_history"
 Column |Type | Modifiers
+-+---
 tid| integer |
 bid| integer |
 aid| integer |
 delta  | integer |
 mtime  | timestamp without time zone |
 filler | character(22)   |
Indexes:
"pgbench_history_bid" hash (bid)


2) I have also made use of the following tools for analyzing the hash
index tables created on master and standby.

a) pgstattuple : Using this tool, i could verfiy the tuple level
statistics which includes index table physical length, dead tuple
percentageand other infos on both master and standby. However, i could
see below error message when using this tool for one of the hash index
table on both master and standby server.

postgres=# SELECT * FROM pgstattuple('pgbench_history_bid');
ERROR:  index "pgbench_history_bid" contains unexpected zero page at
block 2482297

To investigate on this, i made use of pageinspect contrib module and
came to know that above block contains an uninitialised page. But,
this is quite possible in hash index as here the bucket split happens
in the power-of-2 and we may find some of the uninitialised bucket
pages.

b) pg_filedump : Using this tool, i could take a dump of all the hash
index tables on master and standy and compare the dump file to verify
if there is any difference between hash index pages on both master and
standby.

In short, this is all i did to verify the patch for wal logging in hash index.

I would like thank Amit and Robert for their valuable inputs during
the hash index testing phase.

I am also planning to verify wal logging in hash index using Kuntal's
patch for WAL consistency check once it is stablized.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Ashutosh Sharma

Hi All,

I am getting following error when running the test script shared by
Jeff -[1] . The error is observed upon executing the test script for
around 3-4 hrs.

57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR:  lock
buffer_content 1 is not held
57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT:  insert into
foo (index) select $1 from generate_series(1,1)

124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR:  lock
buffer_content 10 is not held
124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT:  insert
into foo (index) select $1 from generate_series(1,1)

124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR:  lock
buffer_content 10 is not held
124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT:  insert
into foo (index) select $1 from generate_series(1,1)

[1]- 
https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com

Please note that i am performing the test on latest patch for
concurrent hash index and WAL log in hash index shared by Amit
yesterday.


With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com



On Wed, Sep 14, 2016 at 12:04 AM, Jesper Pedersen
 wrote:
> On 09/13/2016 07:41 AM, Amit Kapila wrote:
>>>
>>> README:
>>> +in_complete split flag.  The reader algorithm works correctly, as it
>>> will
>>> scan
>>>
>>> What flag ?
>>>
>>
>> in-complete-split flag which indicates that split has to be finished
>> for that particular bucket.  The value of these flags are
>> LH_BUCKET_NEW_PAGE_SPLIT and LH_BUCKET_OLD_PAGE_SPLIT for new and old
>> bucket respectively.  It is set in hasho_flag in special area of page.
>> I have slightly expanded the definition in README, but not sure if it
>> is good idea to mention flags defined in hash.h. Let me know, if still
>> it is unclear or you want some thing additional to be added in README.
>>
>
> I think it is better now.
>
>>> hashxlog.c:
>>>
>>> hash_xlog_move_page_contents
>>> hash_xlog_squeeze_page
>>>
>>> Both have "bukcetbuf" (-> "bucketbuf"), and
>>>
>>> +   if (BufferIsValid(bukcetbuf));
>>>
>>> ->
>>>
>>> +   if (BufferIsValid(bucketbuf))
>>>
>>> and indent following line:
>>>
>>> LockBufferForCleanup(bukcetbuf);
>>>
>>> hash_xlog_delete
>>>
>>> has the "if" issue too.
>>>
>>
>> Fixed all the above cosmetic issues.
>>
>
> I meant there is an extra ';' on the "if" statements:
>
> +   if (BufferIsValid(bukcetbuf)); <--
>
> in hash_xlog_move_page_contents, hash_xlog_squeeze_page and
> hash_xlog_delete.
>
>
> Best regards,
>  Jesper
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-14 Thread Ashutosh Sharma

Hi All,

Below is the backtrace for the issue reported in my earlier mail [1].
>From the callstack it looks like we are trying to release  lock on a
meta page twice in _hash_expandtable().

(gdb) bt
#0  0x007b01cf in LWLockRelease (lock=0x7f55f59d0570) at lwlock.c:1799
#1  0x0078990c in LockBuffer (buffer=2, mode=0) at bufmgr.c:3540
#2  0x004a5d3c in _hash_chgbufaccess (rel=0x7f5605b608c0,
buf=2, from_access=1, to_access=-1) at hashpage.c:331
#3  0x004a722d in _hash_expandtable (rel=0x7f5605b608c0,
metabuf=2) at hashpage.c:995
#4  0x004a316a in _hash_doinsert (rel=0x7f5605b608c0,
itup=0x2ba5030) at hashinsert.c:313
#5  0x004a0d85 in hashinsert (rel=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "", ht_ctid=0x2c2e4f4,
heapRel=0x7f5605b58250,
checkUnique=UNIQUE_CHECK_NO) at hash.c:248
#6  0x004c5a16 in index_insert (indexRelation=0x7f5605b608c0,
values=0x7ffdf5409c70, isnull=0x7ffdf5409c50 "",
heap_t_ctid=0x2c2e4f4,
heapRelation=0x7f5605b58250, checkUnique=UNIQUE_CHECK_NO) at indexam.c:204
#7  0x0063f2d5 in ExecInsertIndexTuples (slot=0x2b9a2c0,
tupleid=0x2c2e4f4, estate=0x2b99c00, noDupErr=0 '\000',
specConflict=0x0,
arbiterIndexes=0x0) at execIndexing.c:388
#8  0x0066a1fc in ExecInsert (mtstate=0x2b99e50,
slot=0x2b9a2c0, planSlot=0x2b9a2c0, arbiterIndexes=0x0,
onconflict=ONCONFLICT_NONE,
estate=0x2b99c00, canSetTag=1 '\001') at nodeModifyTable.c:481
#9  0x0066b841 in ExecModifyTable (node=0x2b99e50) at
nodeModifyTable.c:1496
#10 0x00645e51 in ExecProcNode (node=0x2b99e50) at execProcnode.c:396
#11 0x006424d9 in ExecutePlan (estate=0x2b99c00,
planstate=0x2b99e50, use_parallel_mode=0 '\000', operation=CMD_INSERT,
sendTuples=0 '\000',
numberTuples=0, direction=ForwardScanDirection, dest=0xd73c20
) at execMain.c:1567
#12 0x0064087a in standard_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:338
#13 0x7f55fe590605 in pgss_ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at pg_stat_statements.c:877
#14 0x00640751 in ExecutorRun (queryDesc=0x2b89c70,
direction=ForwardScanDirection, count=0) at execMain.c:284
#15 0x007c331e in ProcessQuery (plan=0x2b873f0,
sourceText=0x2b89bd0 "insert into foo (index) select $1 from
generate_series(1,1)",
params=0x2b89c20, dest=0xd73c20 ,
completionTag=0x7ffdf540a3f0 "") at pquery.c:187
#16 0x007c4a0c in PortalRunMulti (portal=0x2ae7930,
isTopLevel=1 '\001', setHoldSnapshot=0 '\000', dest=0xd73c20
,
altdest=0xd73c20 , completionTag=0x7ffdf540a3f0 "")
at pquery.c:1303
#17 0x007c4055 in PortalRun (portal=0x2ae7930,
count=9223372036854775807, isTopLevel=1 '\001', dest=0x2b5dc90,
altdest=0x2b5dc90,
completionTag=0x7ffdf540a3f0 "") at pquery.c:815
#18 0x007bfb45 in exec_execute_message (portal_name=0x2b5d880
"", max_rows=9223372036854775807) at postgres.c:1977
#19 0x007c25c7 in PostgresMain (argc=1, argv=0x2b05df8,
dbname=0x2aca3c0 "postgres", username=0x2b05de0 "edb") at
postgres.c:4133
#20 0x00744d8f in BackendRun (port=0x2aefa60) at postmaster.c:4260
#21 0x00744523 in BackendStartup (port=0x2aefa60) at postmaster.c:3934
#22 0x00740f9c in ServerLoop () at postmaster.c:1691
#23 0x00740623 in PostmasterMain (argc=4, argv=0x2ac81f0) at
postmaster.c:1299
#24 0x006940fe in main (argc=4, argv=0x2ac81f0) at main.c:228

Please let me know for any further inputs.

[1]- 
https://www.postgresql.org/message-id/CAE9k0Pmxh-4NAr4GjzDDFHdBKDrKy2FV-Z%2B2Tp8vb2Kmxu%3D6zg%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Wed, Sep 14, 2016 at 2:45 PM, Ashutosh Sharma  wrote:
> Hi All,
>
> I am getting following error when running the test script shared by
> Jeff -[1] . The error is observed upon executing the test script for
> around 3-4 hrs.
>
> 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:ERROR:  lock
> buffer_content 1 is not held
> 57869 INSERT XX000 2016-09-14 07:58:01.211 IST:STATEMENT:  insert into
> foo (index) select $1 from generate_series(1,1)
>
> 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:ERROR:  lock
> buffer_content 10 is not held
> 124388 INSERT XX000 2016-09-14 11:24:13.593 IST:STATEMENT:  insert
> into foo (index) select $1 from generate_series(1,1)
>
> 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:ERROR:  lock
> buffer_content 10 is not held
> 124381 INSERT XX000 2016-09-14 11:24:13.594 IST:STATEMENT:  insert
> into foo (index) select $1 from generate_series(1,1)
>
> [1]- 
> https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gma

Re: [HACKERS] pageinspect: Hash index support

2016-09-20 Thread Ashutosh Sharma

Hi,

I am getting a fatal error along with some warnings when trying to
apply the v3 patch using "git apply" command.

[ashu@localhost postgresql]$ git apply
0001-pageinspect-Hash-index-support_v3.patch
0001-pageinspect-Hash-index-support_v3.patch:29: trailing whitespace.
  brinfuncs.o ginfuncs.o hashfuncs.o $(WIN32RES)
0001-pageinspect-Hash-index-support_v3.patch:36: trailing whitespace.
DATA = pageinspect--1.6.sql pageinspect--1.5--1.6.sql \
0001-pageinspect-Hash-index-support_v3.patch:37: trailing whitespace.
pageinspect--1.4--1.5.sql pageinspect--1.3--1.4.sql \
0001-pageinspect-Hash-index-support_v3.patch:38: trailing whitespace.
pageinspect--1.2--1.3.sql pageinspect--1.1--1.2.sql \
0001-pageinspect-Hash-index-support_v3.patch:39: trailing whitespace.
pageinspect--1.0--1.1.sql pageinspect--unpackaged--1.0.sql
fatal: git apply: bad git-diff - expected /dev/null on line 46

Also, i think the USAGE for hash_metap() is refering to
hash_metap_bytea(). Please correct it. I have just started reviewing
the patch, will keep on posting my comments upon further review.
Thanks.

With Regards,
Ashutosh Sharma.
EnterpriseDB: http://www.enterprisedb.com

On Tue, Sep 20, 2016 at 10:15 PM, Jeff Janes  wrote:
> On Tue, Sep 20, 2016 at 5:40 AM, Jesper Pedersen
>  wrote:
>>
>> On 09/20/2016 03:19 AM, Michael Paquier wrote:
>>>
>>> You did not get right the comments from Alvaro upthread. The following
>>> functions are added with this patch:
>>>  function hash_metap(text)
>>>  function hash_metap_bytea(bytea)
>>>  function hash_page_items(text,integer)
>>>  function hash_page_items_bytea(bytea)
>>>  function hash_page_stats(text,integer)
>>>  function hash_page_stats_bytea(bytea,integer)
>>>
>>> Now the following set of functions would be sufficient:
>>> function hash_metapage_info(bytea)
>>> function hash_page_items(bytea)
>>> function hash_page_stats(bytea)
>>> The last time pageinspect has been updated, when BRIN functions have
>>> been added, it has been discussed to just use (bytea) as an argument
>>> interface and just rely on get_raw_page() to get the pages wanted, so
>>> I think that we had better stick with that and keep things simple.
>>>
>>
>> Yes, I know, Alvaro and you voted for the bytea methods, and Jeff asked
>> for both.
>>
>> Attached is v3 with only the bytea based methods.
>>
>> Alvaro, Michael and Jeff - Thanks for the review !
>
>
>
> Is the 2nd "1" in this call needed?
>
> SELECT * FROM hash_page_stats(get_raw_page('mytab_index', 1), 1)
>
> As far as I can tell, that argument is only used to stuff into the output
> field "blkno", it is not used to instruct the interpretation of the raw page
> itself.  It doesn't seem worthwhile to have the parameter that only echos
> back to the user what the user already put in (twice).  The only existing
> funtions which take the blkno argument are those that don't use the
> get_raw_page style.
>
> Also, should we document what the single letter values mean in the
> hash_page_stats.type column?  It is not obvious that 'i' means bitmap, for
> example.
>
> Cheers,
>
> Jeff

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Ashutosh Sharma

bucket page? I guess
we can't have such type of hash page and incase if we found such type
of undefined page i guess we should error out instead of reading the
page because it is quite possible that the page would be corrupted.
Please let me know your thoughts on this.

5). I think we have added enough functions to show the page level
statistics but not the index level statistics like the total number of
overflow pages , bucket pages, number of free overflow pages, number
of bitmap pages etc. in the hash index. How about adding a function
that would store the index level statistics?

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2016-09-21 Thread Ashutosh Sharma

Hi,

> git apply w/ v4 works here, so you will have to investigate what happens on
> your side.
>

Thanks, It works with v4 patch.

> As these functions are marked as superuser only it is expected that people
> provides the correct input, especially since the raw page structure is used
> as the input.

Well, page_stats / page_items does accept bitmap page as input but
these function are not defined to read bitmap page as bitmap page
doesnot have a standard page layout. Infact if we pass bitmap page as
an input to page_stats / page_items, the output is always same. I
think we need to have a separete function to read bitmap page. And i
am not sure why should a server crash if i pass meta page to
hash_page_stats or hash_page_items.


> The "if" statement will need updating once the CHI patch goes in, as it
> defines new constants for page types.

Not sure if CHI patch is adding a new type of hash page other than
holding an information about split in progress. Basically my point was
can we have hash page of types other than meta page, bucket page,
overflow and bitmap page. If pageinspect tool finds a page that
doesnot fall under any of these category shouldn't it error out.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Write Ahead Logging for Hash Indexes

2016-09-25 Thread Ashutosh Sharma

Hi All,

> I forgot to mention that Ashutosh has tested this patch for a day
> using Jeff's tool and he didn't found any problem.  Also, he has found
> a way to easily reproduce the problem.  Ashutosh, can you share your
> changes to the script using which you have reproduce the problem?

I made slight changes in Jeff Janes patch (crash_REL10.patch) to
reproduce the issue reported by him earlier [1]. I changed the
crash_REL10.patch such that a torn page write happens only for a
bitmap page in hash index. As described by Amit in [2] & [3] we were
not logging full page image for bitmap page and that's the reason we
were not be able to restore bitmap page in case it is a torn page
which would eventually result in below error.

38422  01000 2016-09-19 16:25:50.061 PDT:WARNING:  page verification
failed, calculated checksum 65067 but expected 21260
38422  01000 2016-09-19 16:25:50.061 PDT:CONTEXT:  xlog redo at
3F/22053B50 for Hash/ADD_OVFL_PAGE: bmsize 4096, bmpage_found T
38422  XX001 2016-09-19 16:25:50.071 PDT:FATAL:  invalid page in block
9 of relation base/16384/17334

Jeff Jane's original patch had following conditions for a torn page write:

if (JJ_torn_page > 0 && counter++ > JJ_torn_page && !RecoveryInProgress())
nbytes = FileWrite(v->mdfd_vfd, buffer, BLCKSZ/3);

Ashutosh has added some more condition's in above if( ) to ensure that
the issue associated with bitmap page gets reproduced easily:

if (JJ_torn_page > 0 && counter++ > JJ_torn_page &&
!RecoveryInProgress() && bucket_opaque->hasho_page_id == HASHO_PAGE_ID
&& bucket_opaque->hasho_flag & LH_BITMAP_PAGE)
nbytes = FileWrite(v->mdfd_vfd, buffer, 24);

Also, please note that i am allowing only 24 bytes of data i.e. just a
page header to be written into a disk file so that even if a single
overflow page is added into hash index table the issue will be
observed.

Note: You can find Jeff Jane's patch for torn page write at - [4].

[1] - 
https://www.postgresql.org/message-id/CAMkU%3D1zGcfNTWWxnud8j_p0vb1ENGcngkwqZgPUEnwZmAN%2BXQw%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1LmQZGnYhSHXDDCOsSb_0U-gsxReEmSDRgCZr%3DAdKbTEg%40mail.gmail.com
[3] - 
https://www.postgresql.org/message-id/CAA4eK1%2Bk9tGPw7vOACRo%2B4h5n%3DutHnSuGgMoJrLANybAjNBW9w%40mail.gmail.com
[4] - 
https://www.postgresql.org/message-id/CAMkU%3D1xRt8jBBB7g_7K41W00%3Dbr9UrxMVn_rhWhKPLaHfEdM5A%40mail.gmail.com

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pg_basebackup creates a corrupt file for pg_stat_tmp and pg_replslot on a backup location

2016-09-25 Thread Ashutosh Sharma

Hi Peter,

> I just wanted to update you, I have taken this commit fest entry patch
> to review because I think it will be addresses as part of "Exclude
> additional directories in pg_basebackup", which I'm also reviewing.
> Therefore, I'm not actually planning on discussing this patch further.
> Please correct me if this assessment does not match your expectations.

Thanks for the update. I am absolutely OK with it. I feel it would be
a good idea to review "Exclude additional directories in
pg_basebackup" which also addresses the issue reported by me.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] dead or outdated URLs found in win32.h

2017-10-06 Thread Ashutosh Sharma

Hi All,

The following URLs present in src/include/port/win32.h to share the
information on dllimport or dllexport (used in windows) seems to be
either dead/obsolete and i think, they need to be updated.

http://support.microsoft.com/kb/132044 -- dead
http://msdn.microsoft.com/en-us/library/8fskxacy(v=vs.80).aspx -- outdated
http://msdn.microsoft.com/en-us/library/a90k134d(v=vs.80).aspx -- outdated

https://msdn.microsoft.com/en-gb/library/aa489609.aspx -- outdated

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Microvacuum support for Hash Index

2016-10-24 Thread Ashutosh Sharma

Hi All,

I have added a microvacuum support for hash index access method and
attached is the v1 patch for the same. The patch basically takes care
of the following things:

1. Firstly, it changes the marking of dead tuples from
'tuple-at-a-time' to 'page-at-a-time' during hash index scan. For this
we accumulate the heap tids and offset of all the hash index tuples if
it is pointed by kill_prior_tuple during scan and then mark all
accumulated tids as LP_DEAD either while stepping from one page to
another (assuming the scan in both forward and backward direction) or
during end of the hash index scan or during rescan.

2. Secondly, when inserting tuple into hash index table, if not enough
space is found on a current page then it ensures that we first clean
the dead tuples if found in the current hash index page before moving
to the next page in a bucket chain or going for a bucket split. This
basically increases the page reusability and reduces the number of
page splits, thereby reducing the overall size of hash index table.

I have compared the hash index size with and without my patch
(microvacuum_hash_index_v1.patch attached with this mail) on a high
end machine at various scale factors and the results are shown below.
For testing this, i have created hash index (pgbench_accounts_aid) on
aid column of 'pgbench_accounts' table instead of primary key and the
results shown in below table are for the same. The patch
(pgbench.patch) having these changes is also attached with this mail.
Moreover, I am using my own script file (file_hash_kill_prior_tuple)
for updating the index column with pgbench read-write command. The
script file 'file_hash_kill_prior_tuple' is also attached with this
mail.

Here are some initial test results showing the benefit of this patch:

postgresql.conf and pgbench settings:
autovacuum=off
client counts = 64
run time duration = 15 mins

./pgbench -c $threads -j $threads -T 900 postgres -f
~/file_hash_kill_prior_tuple

Scale Factorhash index size @ start HEADHEAD + Patch
10   32 MB 579 MB  158 MB
50   128 MB   630 MB  350 MB
100 256 MB  1255 MB  635 MB
300 1024 MB2233 MB  1093 MB


As shown in above result, at 10 scale factor the hash index size has
reduced by almost 4 times whereas at 50 and 300 scale factors it has
reduced by half with my patch. This basically proves that we can
reduce the hash index size to a good extent with this patch.

System specifications:
-
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel


Note: The patch (microvacuum_hash_index_v1.patch) is prepared on top
of concurrent_hash_index_v8.patch-[1] and wal_hash_index_v5.1.patch[2]
for hash index.


[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BX%3D8sUd1UCZDZnE3D9CGi9kw%2Bkjxp2Tnw7SX5w8pLBNw%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1KE%3D%2BkkowyYD0vmch%3Dph4ND3H1tViAB%2B0cWTHqjZDDfqg%40mail.gmail.com
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index db73f05..a0720ef 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -325,14 +325,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.)
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+so->killedItems = palloc(MaxIndexTuplesPerPage *
+		 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+so->killedItems[so->numKilled].indexOffset =
+			ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+so->numKilled++;
+			}
 		}
 
 		/*
@@ -439,6 +446,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so->hashso_skip_moved_tuples = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -454,6 +464,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set p

Re: [HACKERS] Microvacuum support for Hash Index

2016-10-30 Thread Ashutosh Sharma

Hi Amit,

Thanks for showing your interest and reviewing my patch. I have
started looking into your review comments. I will share the updated
patch in a day or two.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com

On Fri, Oct 28, 2016 at 4:42 PM, Amit Kapila  wrote:
> On Mon, Oct 24, 2016 at 2:21 PM, Ashutosh Sharma  
> wrote:
>> Hi All,
>>
>> I have added a microvacuum support for hash index access method and
>> attached is the v1 patch for the same.
>>
>
> This is an important functionality for hash index as we already do
> have same functionality for other types of indexes like btree.
>
>> The patch basically takes care
>> of the following things:
>>
>> 1. Firstly, it changes the marking of dead tuples from
>> 'tuple-at-a-time' to 'page-at-a-time' during hash index scan. For this
>> we accumulate the heap tids and offset of all the hash index tuples if
>> it is pointed by kill_prior_tuple during scan and then mark all
>> accumulated tids as LP_DEAD either while stepping from one page to
>> another (assuming the scan in both forward and backward direction) or
>> during end of the hash index scan or during rescan.
>>
>> 2. Secondly, when inserting tuple into hash index table, if not enough
>> space is found on a current page then it ensures that we first clean
>> the dead tuples if found in the current hash index page before moving
>> to the next page in a bucket chain or going for a bucket split. This
>> basically increases the page reusability and reduces the number of
>> page splits, thereby reducing the overall size of hash index table.
>>
>
> Few comments on patch:
>
> 1.
> +static void
> +hash_xlog_vacuum_one_page(XLogReaderState *record)
> +{
> + XLogRecPtr lsn = record->EndRecPtr;
> + xl_hash_vacuum *xldata = (xl_hash_vacuum *) XLogRecGetData(record);
> + Buffer bucketbuf = InvalidBuffer;
> + Buffer buffer;
> + Buffer metabuf;
> + Page page;
> + XLogRedoAction action;
>
> While replaying the delete/vacuum record on standby, it can conflict
> with some already running queries.  Basically the replay can remove
> some row which can be visible on standby.  You need to resolve
> conflicts similar to what we do in btree delete records (refer
> btree_xlog_delete).
>
> 2.
> + /*
> + * Write-lock the meta page so that we can decrement
> + * tuple count.
> + */
> + _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
> +
> + _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
> +  (buf == bucket_buf) ? true : false);
> +
> + _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
>
> It seems here meta page lock is acquired for duration more than
> required and also it is not required when there are no deletable items
> on page. You can take the metapage lock before decrementing the count.
>
> 3.
>   Assert(offnum <= maxoff);
> +
>
> Spurious space.  There are some other similar spurious white space
> changes in patch, remove them as well.
>
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2016-11-01 Thread Ashutosh Sharma

Hi,

> While replaying the delete/vacuum record on standby, it can conflict
> with some already running queries.  Basically the replay can remove
> some row which can be visible on standby.  You need to resolve
> conflicts similar to what we do in btree delete records (refer
> btree_xlog_delete).

Agreed. Thanks for putting this point. I have taken care of it in the
attached v2 patch.

> + /*
> + * Write-lock the meta page so that we can decrement
> + * tuple count.
> + */
> + _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
> +
> + _hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
> +  (buf == bucket_buf) ? true : false);
> +
> + _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
>
> It seems here meta page lock is acquired for duration more than
> required and also it is not required when there are no deletable items
> on page. You can take the metapage lock before decrementing the count.

Ok. Corrected. Please refer to the attached v2 patch.


> Spurious space.  There are some other similar spurious white space
> changes in patch, remove them as well.

Corrected. Please refer attached v2 patch.
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index db73f05..4a4d614 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -157,7 +157,7 @@ hashbuild(Relation heap, Relation index, IndexInfo *indexInfo)
 	if (buildstate.spool)
 	{
 		/* sort the tuples and insert them into the index */
-		_h_indexbuild(buildstate.spool);
+		_h_indexbuild(buildstate.spool, heap->rd_node);
 		_h_spooldestroy(buildstate.spool);
 	}
 
@@ -196,6 +196,8 @@ hashbuildCallback(Relation index,
 	Datum		index_values[1];
 	bool		index_isnull[1];
 	IndexTuple	itup;
+	Relation	rel;
+	RelFileNode	rnode;
 
 	/* convert data to a hash key; on failure, do not insert anything */
 	if (!_hash_convert_tuple(index,
@@ -212,8 +214,12 @@ hashbuildCallback(Relation index,
 		/* form an index tuple and point it at the heap tuple */
 		itup = index_form_tuple(RelationGetDescr(index),
 index_values, index_isnull);
+		/* Get RelfileNode from relation OID */
+		rel = relation_open(htup->t_tableOid, NoLock);
+		rnode = rel->rd_node;
+		relation_close(rel, NoLock);
 		itup->t_tid = htup->t_self;
-		_hash_doinsert(index, itup);
+		_hash_doinsert(index, itup, rnode);
 		pfree(itup);
 	}
 
@@ -245,7 +251,7 @@ hashinsert(Relation rel, Datum *values, bool *isnull,
 	itup = index_form_tuple(RelationGetDescr(rel), index_values, index_isnull);
 	itup->t_tid = *ht_ctid;
 
-	_hash_doinsert(rel, itup);
+	_hash_doinsert(rel, itup, heapRel->rd_node);
 
 	pfree(itup);
 
@@ -325,14 +331,21 @@ hashgettuple(IndexScanDesc scan, ScanDirection dir)
 		if (scan->kill_prior_tuple)
 		{
 			/*
-			 * Yes, so mark it by setting the LP_DEAD state in the item flags.
+			 * Yes, so remember it for later. (We'll deal with all such
+			 * tuples at once right after leaving the index page or at
+			 * end of scan.)
 			 */
-			ItemIdMarkDead(PageGetItemId(page, offnum));
+			if (so->killedItems == NULL)
+so->killedItems = palloc(MaxIndexTuplesPerPage *
+		 sizeof(HashScanPosItem));
 
-			/*
-			 * Since this can be redone later if needed, mark as a hint.
-			 */
-			MarkBufferDirtyHint(buf, true);
+			if (so->numKilled < MaxIndexTuplesPerPage)
+			{
+so->killedItems[so->numKilled].heapTid = so->hashso_heappos;
+so->killedItems[so->numKilled].indexOffset =
+			ItemPointerGetOffsetNumber(&(so->hashso_curpos));
+so->numKilled++;
+			}
 		}
 
 		/*
@@ -439,6 +452,9 @@ hashbeginscan(Relation rel, int nkeys, int norderbys)
 
 	so->hashso_skip_moved_tuples = false;
 
+	so->killedItems = NULL;
+	so->numKilled = 0;
+
 	scan->opaque = so;
 
 	return scan;
@@ -454,6 +470,10 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	/* set position invalid (this will cause _hash_first call) */
@@ -480,6 +500,10 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
+	/* Before leaving current page, deal with any killed items */
+	if (so->numKilled > 0)
+		hashkillitems(scan);
+
 	_hash_dropscanbuf(rel, so);
 
 	pfree(so);
@@ -809,6 +833,15 @@ hashbucketcleanup(Relation rel, Buffer bucket_buf,
 			PageIndexMultiDelete(page, deletable, ndeletable);
 			bucket_dirty = true;
 
+			/*
+			 * Let us mark the page as clean if vacuum removes the DEAD tuples
+			 * from an index page. We do this by clearing LH_PAGE_HAS_DEAD_TUPLES
+			 * flag.
+			 */
+			if (tuples_removed && *tuples_removed > 0 &&
+opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
+opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
+
 			MarkBufferDirty(buf);
 
 			/* XLOG stuff */
diff --git a/src/backend/acce

Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.

2016-11-02 Thread Ashutosh Sharma

Hi,

I have started with the review for this patch and would like to share
some of my initial review comments that requires author's attention.

1) I am getting some trailing whitespace errors when trying to apply
this patch using git apply command. Following are the error messages i
am getting.

[edb@localhost postgresql]$ git apply test_pg_stat_statements-v1.patch
test_pg_stat_statements-v1.patch:28: trailing whitespace.
$(MAKE) -C $(top_builddir)/src/test/regress all
test_pg_stat_statements-v1.patch:41: space before tab in indent.
 $(REGRESSCHECKS)
warning: 2 lines add whitespace errors.

2) The test-case you have added is failing that is because queryid is
going to be different every time you execute the test-case. I think it
would be better to remove the queryid column from "SELECT queryid,
query, calls, rows from pg_stat_statements ORDER BY rows". Below is
the output i got upon test-case execution.

make regresscheck

== running regression test queries==
test pg_stat_statements   ... FAILED
== shutting down postmaster   ==

3) Ok. As you mentioned upthread, I do agree with the fact that we
can't add pg_stat_statements tests for installcheck as this module
requires shared_preload_libraries to be set. But, if there is an
already existing installation with pg_stat_statements module loaded we
should allow the test to run on this installation if requested
explicitly by the user. I think we can add some target like
'installcheck-force' in the MakeFile that would help the end users to
run the test on his own installation. Thoughts?

Also, I am changed the status in the commit-fest from "Needs review"
to "waiting on Author".

On Mon, Sep 19, 2016 at 6:26 PM, Amit Kapila  wrote:
> On Sat, Sep 10, 2016 at 5:02 PM, Amit Kapila  wrote:
>> On Tue, Aug 30, 2016 at 8:01 AM, Andres Freund  wrote:
>>> On 2016-08-30 07:57:19 +0530, Amit Kapila wrote:
 I will write such a test case either in this week or early next week.
>>>
>>> Great.
>>>
>>
>> Okay, attached patch adds some simple tests for pg_stat_statements.
>> One thing to note is that we can't add pg_stat_statements tests for
>> installcheck as this module requires shared_preload_libraries to be
>> set.  Hope this suffices the need.
>>
>
> Registered this patch for next CF.
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Applying XLR_INFO_MASK correctly when looking at WAL record information

2016-11-03 Thread Ashutosh Sharma

Hi,

> What about the patch attached to make things more consistent?

I have reviewed this patch. It does serve the purpose and looks sane
to me. I am marking it as ready for committer.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Re: [COMMITTERS] pgsql: Change the way that LWLocks for extensions are allocated.

2016-11-08 Thread Ashutosh Sharma

Hi,

> I had also thought about it while preparing patch, but I couldn't find
> any clear use case.  I think it could be useful during development of
> a module, but not sure if it is worth to add a target.  I think there
> is no harm in adding such a target, but lets take an opinion from
> committer or others as well before adding this target.  What do you
> say?

Ok. We can do that.

I have verified the updated v2 patch. The patch looks good to me. I am
marking it  as ready for committer. Thanks.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2016-11-10 Thread Ashutosh Sharma

Hi Jesper,

> Some initial comments.
>
> _hash_vacuum_one_page:
>
> +   END_CRIT_SECTION();
> +   _hash_chgbufaccess(rel, metabuf, HASH_READ, HASH_NOLOCK);
>
> The _hash_chgbufaccess() needs a comment.
>
> You also need a place where you call pfree for so->killedItems - maybe in
> hashkillitems().

Thanks for reviewing this patch. I would like to update you that this
patch has got dependency on a patch for concurrent hash index and WAL
log in hash index. So, till these two patches for hash index are not
stable I won't be able to share you a next version of patch for
supporting microvacuum in hash index.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Parallel worker crash on seqscan

2016-11-21 Thread Ashutosh Sharma

Hi,

> the following query appears to reliably crash parallel workers on master
> as of 0832f2d.
>
> --8<---cut here---start->8---
> set max_parallel_workers_per_gather to 2;
> set force_parallel_mode to 1;
>
> select subq.context from pg_settings,
> lateral (select context from pg_opclass limit 1) as subq
> limit 1;
> --8<---cut here---end--->8---

As suggested, I have tried to reproduce this issue on *0832f2d* commit but
could not reproduce it. I also tried it on the latest commit in master
branch and could not reproduce here as well. Amit (included in this email
thread) has also tried it once and he was also not able to reproduce it.
Could you please let me know if there is something more that needs to be
done in order to reproduce it other than what you have shared above. Thanks.

With Regards,
Ashutosh Sharma.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-17 Thread Ashutosh Sharma

Hi,

>> On Thu, Feb 16, 2017 at 5:07 AM, Alexander Korotkov
>>  wrote:
>> > On Wed, Feb 15, 2017 at 8:49 PM, Alvaro Herrera
>> > 
>> > wrote:
>> >> Alexander Korotkov wrote:
>> >>
>> >> > Difference between master, pgxact-align-2 and pgxact-align-3 doesn't
>> >> > exceed
>> >> > per run variation.
>> >>
>> >> FWIW this would be more visible if you added error bars to each data
>> >> point.  Should be simple enough in gnuplot ...
>> >
>> > Good point.
>> > Please find graph of mean and errors in attachment.
>>
>> So ... no difference?
>
>
> Yeah, nothing surprising.  It's just another graph based on the same data.
> I wonder how pgxact-align-3 would work on machine of Ashutosh Sharma,
> because I observed regression there in write-heavy benchmark of
> pgxact-align-2.

I am yet to benchmark pgxact-align-3 patch on a read-write workload. I
could not do it as our benchmarking machine was already reserved for
some other test work. But, I am planning to do it on this weekend.
Will try to post my results by Monday evening. Thank you and sorry for
the delayed response.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-20 Thread Ashutosh Sharma

Hi,

>> On Thu, Feb 16, 2017 at 5:07 AM, Alexander Korotkov
> >>  wrote:
> >> > On Wed, Feb 15, 2017 at 8:49 PM, Alvaro Herrera
> >> > 
> >> > wrote:
> >> >> Alexander Korotkov wrote:
> >> >>
> >> >> > Difference between master, pgxact-align-2 and pgxact-align-3
> doesn't
> >> >> > exceed
> >> >> > per run variation.
> >> >>
> >> >> FWIW this would be more visible if you added error bars to each data
> >> >> point.  Should be simple enough in gnuplot ...
> >> >
> >> > Good point.
> >> > Please find graph of mean and errors in attachment.
> >>
> >> So ... no difference?
> >
> >
> > Yeah, nothing surprising.  It's just another graph based on the same
> data.
> > I wonder how pgxact-align-3 would work on machine of Ashutosh Sharma,
> > because I observed regression there in write-heavy benchmark of
> > pgxact-align-2.
>
> I am yet to benchmark pgxact-align-3 patch on a read-write workload. I
> could not do it as our benchmarking machine was already reserved for
> some other test work. But, I am planning to do it on this weekend.
> Will try to post my results by Monday evening. Thank you and sorry for
> the delayed response.
>

Following are the pgbench results for read-write workload, I got with
pgxact-align-3 patch. The results are for 300 scale factor with 8GB of
shared buffer i.e. when data fits into the shared buffer. For 1000 scale
factor with 8GB shared buffer the test is still running, once it is
completed I will share the results for that as well.

*pgbench settings:*
pgbench -i -s 300 postgres
pgbench -M prepared -c $thread -j $thread -T $time_for_reading  postgres

where, time_for_reading = 30mins

*non default GUC param:*
shared_buffers=8GB
max_connections=300

pg_xlog is located in SSD.

*Machine details:*
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz


CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
4 4283 4220 -1.47093159
8 7291 7261 -0.4114661912
16 11909 12149 2.015282559
32 20789 20745 -0.211650392
64 28412 27831 -2.044910601
128 29369 28765 -2.056590282
156 27949 27189 -2.719238613
180 27876 27171 -2.529057254
196 28849 27872 -3.386599189
256 30321 28188 -7.034728406
Also, Excel sheet (results-read-write-300-SF) containing the results for
all the 3 runs is attached.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


results-read-write-300-SF.xlsx
Description: MS-Excel 2007 spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect: Hash index support

2017-02-21 Thread Ashutosh Sharma

Thanks for reporting it. This is because of incorrect data typecasting.
Attached is the patch that fixes this issue.

On Tue, Feb 21, 2017 at 2:58 PM, Mithun Cy 
wrote:

> On Fri, Feb 10, 2017 at 1:06 AM, Robert Haas 
> wrote:
>
> > Alright, committed with a little further hacking.
>
> I did pull the latest code, and tried
> Test:
> 
> create table t1(t int);
> create index i1 on t1 using hash(t);
> insert into t1 select generate_series(1, 1000);
>
> postgres=# SELECT spares FROM hash_metapage_info(get_raw_page('i1', 0));
>spares
> 
> 
>  {0,0,0,1,9,17,33,65,-127,1,1,0,1,-1,-1,-4,0,0,0,0,0,0,0,0,
> 0,0,0,0,0,0,0,0}
>
> spares are showing negative numbers; I think the wrong type has been
> chosen, seems it is rounding at 127, spares are defined as following
> uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* spare pages before each
> splitpoint */
>
> it should be always positive.
>
> --
> Thanks and Regards
> Mithun C Y
> EnterpriseDB: http://www.enterprisedb.com
>


typecast_fix.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-21 Thread Ashutosh Sharma

On Tue, Feb 21, 2017 at 4:21 PM, Alexander Korotkov
 wrote:
>
> Hi, Ashutosh!
> On Mon, Feb 20, 2017 at 1:20 PM, Ashutosh Sharma  
> wrote:
>>
>> Following are the pgbench results for read-write workload, I got with 
>> pgxact-align-3 patch. The results are for 300 scale factor with 8GB of 
>> shared buffer i.e. when data fits into the shared buffer. For 1000 scale 
>> factor with 8GB shared buffer the test is still running, once it is 
>> completed I will share the results for that as well.
>>
>> pgbench settings:
>> pgbench -i -s 300 postgres
>> pgbench -M prepared -c $thread -j $thread -T $time_for_reading  postgres
>>
>> where, time_for_reading = 30mins
>>
>> non default GUC param:
>> shared_buffers=8GB
>> max_connections=300
>>
>> pg_xlog is located in SSD.
>
>
> Thank you for testing.
> It seems that there is still regression.  While padding was reduced from 116 
> bytes to 4 bytes, it makes me think that probably there is something wrong in 
> testing methodology.
> Are you doing re-initdb and pgbench -i before each run?  I would ask you to 
> do the both.
> Also, if regression would still exist, let's change the order of versions.  
> Thus, if you run master before patched version, I would ask you to run 
> patched version before master.


Yes, there is still some regression however it has come down by a
small margin. I am not doing initdb for each run instead I am doing,
dropdb-->createdb-->pgbench -i. Is dropping old database and creating
a new one for every run not okay, Do I have to do initdb every time.
Okay, I can change the order of reading and let you know the results.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-21 Thread Ashutosh Sharma

On Tue, Feb 21, 2017 at 5:52 PM, Alexander Korotkov
 wrote:
> On Tue, Feb 21, 2017 at 2:37 PM, Andres Freund  wrote:
>>
>> On 2017-02-21 16:57:36 +0530, Ashutosh Sharma wrote:
>> > Yes, there is still some regression however it has come down by a
>> > small margin. I am not doing initdb for each run instead I am doing,
>> > dropdb-->createdb-->pgbench -i. Is dropping old database and creating
>> > a new one for every run not okay, Do I have to do initdb every time.
>> > Okay, I can change the order of reading and let you know the results.
>>
>> That does make a difference. Primarily because WAL writes in a new
>> cluster are a more expensive than in an old one, because of segment
>> recycling.
>
>
> +1

okay, understood. Will take care of it when I start with next round of
testing. Thanks.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-23 Thread Ashutosh Sharma

Okay. As suggested by Alexander, I have changed the order of reading and
doing initdb for each pgbench run. With these changes, I got following
results at 300 scale factor with 8GB of shared buffer.

*pgbench settings:*
pgbench -i -s 300 postgres
pgbench -M prepared -c $thread -j $thread -T $time_for_reading  postgres

where, time_for_reading = 30mins

*non default GUC param*
shared_buffers=8GB
max_connections=300

pg_xlog is located in SSD.

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
4 2803 2843 1.427042455
8 5315 5225 -1.69332079
32 19755 19669 -0.4353328271
64 28679 27980 -2.437323477
128 28887 28008 -3.042891266
156 27465 26728 -2.683415256
180 27425 26697 -2.654512306
196 28826 27396 -4.960799278
256 29787 28107 -5.640044315
The test machine details are as follows,

*Machine details:*
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz

Also, Excel sheet (results-readwrite-300-SF) containing the results for all
the 3 runs is attached.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:*http://www.enterprisedb.com <http://www.enterprisedb.com/>*


On Thu, Feb 23, 2017 at 2:44 AM, Alvaro Herrera 
wrote:

> I wonder if this "perf c2c" tool with Linux 4.10 might be useful in
> studying this problem.
> https://joemario.github.io/blog/2016/09/01/c2c-blog/
>
> --
> Álvaro Herrerahttps://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>


results-readwrite-300-SF.xlsx
Description: MS-Excel 2007 spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-23 Thread Ashutosh Sharma

On Fri, Feb 24, 2017 at 10:41 AM, Simon Riggs  wrote:
> On 24 February 2017 at 04:41, Ashutosh Sharma  wrote:
>>
>> Okay. As suggested by Alexander, I have changed the order of reading and
>> doing initdb for each pgbench run. With these changes, I got following
>> results at 300 scale factor with 8GB of shared buffer.
>
>
> Would you be able to test my patch also please?
>

Sure, I can do that and share the results by early next week. Thanks.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-28 Thread Ashutosh Sharma

Hi,

On Fri, Feb 24, 2017 at 12:22 PM, Ashutosh Sharma 
wrote:

On Fri, Feb 24, 2017 at 10:41 AM, Simon Riggs  wrote:
> > On 24 February 2017 at 04:41, Ashutosh Sharma 
> wrote:
> >>
> >> Okay. As suggested by Alexander, I have changed the order of reading and
> >> doing initdb for each pgbench run. With these changes, I got following
> >> results at 300 scale factor with 8GB of shared buffer.
> >
> >
> > Would you be able to test my patch also please?
> >
>
> Sure, I can do that and share the results by early next week. Thanks.
>

So, Here are the pgbench results I got with '
*reduce_pgxact_access_AtEOXact.v2.patch*' on a read-write workload.

*pgbench settings:*
pgbench -i -s 300 postgres
pgbench -M prepared -c $thread -j $thread -T $time_for_reading  postgres

where, time_for_reading = 30mins

*non default GUC param*
shared_buffers=8GB
max_connections=300

pg_wal is located in SSD.

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
4 2588 2601 0.5023183926
8 5094 5098 0.0785237534
16 10294 10307 0.1262871576
32 19779 19815 0.182011224
64 27908 28346 1.569442454
72 27823 28416 2.131330194
128 28455 28618 0.5728342998
180 26739 26879 0.5235797898
196 27820 27963 0.5140186916
256 28763 28969 0.7161978931

Also, Excel sheet (results-readwrite-300-SF.xlsx) containing the results
for all the 3 runs is attached.


results-readwrite-300-SF_EOXACT.xlsx
Description: MS-Excel 2007 spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-02-28 Thread Ashutosh Sharma

On Tue, Feb 28, 2017 at 11:44 PM, Simon Riggs  wrote:

> On 28 February 2017 at 11:34, Ashutosh Sharma 
> wrote:
>
>
>> So, Here are the pgbench results I got with '
>> *reduce_pgxact_access_AtEOXact.v2.patch*' on a read-write workload.
>>
>
> Thanks for performing a test.
>
> I see a low yet noticeable performance gain across the board on that
> workload.
>
> That is quite surprising to see a gain on that workload. The main workload
> we have been discussing was the full read-only test (-S). For that case the
> effect should be much more noticeable based upon Andres' earlier comments.
>
> Would it be possible to re-run the test using only the -S workload? Thanks
> very much.
>

Okay, I already had the results for read-oly workload but just forgot to
share it along with the results for read-write test. Here are the results
for read-only
test,

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
4 26322 31259 18.75617354
8 63499 69472 9.406447346
16 181155 186534 2.96928045
32 333504 337533 1.208081462
64 350341 353747 0.9721956608
72 366339 373898 2.063389374
128 443381 478562 7.934710779
180 299875 334118 11.41909129
196 269194 275525 2.351835479
256 220027 235995 7.257291151

The pgbench settings and non-default params are,

pgbench -i -s 300 postgres
pgbench -M prepared -c $thread -j $thread -T $time_for_reading -S postgres

where, time_for_reading = 10mins

*non default param:*
shared_buffers=8GB
max_connections=300

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

>
>> *pgbench settings:*
>> pgbench -i -s 300 postgres
>> pgbench -M prepared -c $thread -j $thread -T $time_for_reading  postgres
>>
>> where, time_for_reading = 30mins
>>
>> *non default GUC param*
>> shared_buffers=8GB
>> max_connections=300
>>
>> pg_wal is located in SSD.
>>
>> CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
>> 4 2588 2601 0.5023183926 <%28502%29%20318-3926>
>> 8 5094 5098 0.0785237534
>> 16 10294 10307 0.1262871576
>> 32 19779 19815 0.182011224
>> 64 27908 28346 1.569442454
>> 72 27823 28416 2.131330194
>> 128 28455 28618 0.5728342998
>> 180 26739 26879 0.5235797898
>> 196 27820 27963 0.5140186916
>> 256 28763 28969 0.7161978931
>>
>> Also, Excel sheet (results-readwrite-300-SF.xlsx) containing the results
>> for all the 3 runs is attached.
>>
>
>
>
> --
> Simon Riggshttp://www.2ndQuadrant.com/
> <http://www.2ndquadrant.com/>
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>

Re: [HACKERS] Should we cacheline align PGXACT?

2017-03-01 Thread Ashutosh Sharma

On Wed, Mar 1, 2017 at 5:29 PM, Simon Riggs  wrote:
>
> On 1 March 2017 at 04:50, Ashutosh Sharma  wrote:
>>
>> On Tue, Feb 28, 2017 at 11:44 PM, Simon Riggs  wrote:
>>>
>>> On 28 February 2017 at 11:34, Ashutosh Sharma  wrote:
>>>
>>>>
>>>> So, Here are the pgbench results I got with 
>>>> 'reduce_pgxact_access_AtEOXact.v2.patch' on a read-write workload.
>>>
>>>
>>> Thanks for performing a test.
>>>
>>> I see a low yet noticeable performance gain across the board on that 
>>> workload.
>>>
>>> That is quite surprising to see a gain on that workload. The main workload 
>>> we have been discussing was the full read-only test (-S). For that case the 
>>> effect should be much more noticeable based upon Andres' earlier comments.
>>>
>>> Would it be possible to re-run the test using only the -S workload? Thanks 
>>> very much.
>>
>>
>> Okay, I already had the results for read-oly workload but just forgot to 
>> share it along with the results for read-write test. Here are the results 
>> for read-only
>> test,
>
>
> Thanks for conducting those comparisons.
>
> So we have variable and sometimes significant gains, with no regressions.
>
> By the shape of the results this helps in different places to the alignment 
> patch. Do we have evidence to commit both?

Well, We have seen some regression in read-write test with pgxact
alignment patch - [1]. I may need to run the test with both the
patches to see the combined effect on performance.

[1] - 
https://www.postgresql.org/message-id/CAE9k0Pk%2BrCuNY%2B7O5XwVXHPuki9t8%3DM7jr4kevxw-hdkpFhS2A%40mail.gmail.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

[HACKERS] Parallel seq. plan is not coming against inheritance or partition table

2017-03-04 Thread Ashutosh Sharma

Hi All,

>From following git commit onwards, parallel seq scan is never getting
selected for inheritance or partitioned tables.


commit 51ee6f3160d2e1515ed6197594bda67eb99dc2cc
Author: Robert Haas 
Date:   Wed Feb 15 13:37:24 2017 -0500

Replace min_parallel_relation_size with two new GUCs.



Steps to reproduce:
==
create table t1 (a integer);

create table t1_1 (check (a >=1  and a < 100)) inherits (t1);
create table t1_2 (check (a >= 100 and a < 200)) inherits (t1);

insert into t1_1 select generate_series(1, 90);
insert into t1_2 select generate_series(100, 190);

analyze t1;
analyze t1_1;
analyze t1_2;

explain analyze select * from t1 where a < 5 OR a > 195;

EXPLAIN ANALYZE output:

1) Prior to "Replace min_parallel_relation_size with two new GUCs" commit,

postgres=# explain analyze select * from t1 where a < 5 OR a > 195;
  QUERY PLAN
---
 Gather  (cost=1000.00..25094.71 rows=48787 width=4) (actual
time=0.431..184.264 rows=4 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Append  (cost=0.00..19216.01 rows=20328 width=4) (actual
time=0.025..167.465 rows=1 loops=3)
 ->  Parallel Seq Scan on t1  (cost=0.00..0.00 rows=1 width=4)
(actual time=0.001..0.001 rows=0 loops=3)
   Filter: ((a < 5) OR (a > 195))
 ->  Parallel Seq Scan on t1_1  (cost=0.00..9608.00 rows=20252
width=4) (actual time=0.023..76.644 rows=1 loops=3)
   Filter: ((a < 5) OR (a > 195))
   Rows Removed by Filter: 283334
 ->  Parallel Seq Scan on t1_2  (cost=0.00..9608.01 rows=75
width=4) (actual time=89.505..89.505 rows=0 loops=3)
   Filter: ((a < 5) OR (a > 195))
   Rows Removed by Filter: 30
 Planning time: 0.343 ms
 Execution time: 188.624 ms
(14 rows)

2) From "Replace min_parallel_relation_size with two new GUCs" commit onwards,

postgres=# explain analyze select * from t1 where a < 5 OR a > 195;
QUERY PLAN
--
 Append  (cost=0.00..34966.01 rows=50546 width=4) (actual
time=0.021..375.747 rows=4 loops=1)
   ->  Seq Scan on t1  (cost=0.00..0.00 rows=1 width=4) (actual
time=0.004..0.004 rows=0 loops=1)
 Filter: ((a < 5) OR (a > 195))
   ->  Seq Scan on t1_1  (cost=0.00..17483.00 rows=50365 width=4)
(actual time=0.016..198.393 rows=4 loops=1)
 Filter: ((a < 5) OR (a > 195))
 Rows Removed by Filter: 850001
   ->  Seq Scan on t1_2  (cost=0.00..17483.01 rows=180 width=4)
(actual time=173.310..173.310 rows=0 loops=1)
 Filter: ((a < 5) OR (a > 195))
 Rows Removed by Filter: 91
 Planning time: 0.812 ms
 Execution time: 377.831 ms
(11 rows)

RCA:

>From "Replace min_parallel_relation_size with two new GUCs" commit
onwards, we are not assigning parallel workers for the child rel with
zero heap pages. This means we won't be able to create a partial
append path as this requires all the child rels within an Append Node
to have a partial path. Please check the following code snippet from
set_append_rel_pathlist().

/* Same idea, but for a partial plan. */
if (childrel->partial_pathlist != NIL)
partial_subpaths = accumulate_append_subpath(partial_subpaths,
   linitial(childrel->partial_pathlist));
else
partial_subpaths_valid = false;

.
.

/*
 * Consider an append of partial unordered, unparameterized partial paths.
 */
if (partial_subpaths_valid)
{
...
...

/* Generate a partial append path. */
appendpath = create_append_path(rel, partial_subpaths, NULL,
parallel_workers);
add_partial_path(rel, (Path *) appendpath);
}


In short, no Gather path would be generated if baserel having an
Append Node contains any child rel without partial path. This means
just because one childRel having zero heap pages didn't get parallel
workers other childRels that was good enough to go for Parallel Seq
Scan had to go for normal seq scan which could be costlier.

Fix:

Attached is the patch that fixes this issue.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


assign_par_workers_for_empty_childRel_v1.patch
Description: invalid/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel seq. plan is not coming against inheritance or partition table

2017-03-05 Thread Ashutosh Sharma

>
> Right, but OTOH, if we assign parallel workers by default, then it is
> quite possible that it would result in much worse plans.  Consider a
> case where partition hierarchy has 1000 partitions and only one of
> them is big enough to allow parallel workers.  Now in this case, with
> your proposed fix it will try to scan all the partitions in parallel
> workers which I think can easily result in bad performance.

Right. But, there can also be a case where 999 partitions are large
and eligible for PSS. In such case as well, PSS won't be selected.

I think
> the right way to make such plans parallel is by using Parallel Append
> node (https://commitfest.postgresql.org/13/987/).  Alternatively, if
> you want to force parallelism in cases like the one you have shown in
> example, you can use Alter Table .. Set (parallel_workers = 1).

Okay, I was not aware of Parallel Append. Thanks.

With Regards,
Ashutosh Sharma
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel seq. plan is not coming against inheritance or partition table

2017-03-06 Thread Ashutosh Sharma

> I also think that commit didn't intend to change the behavior,
> however, the point is how sensible is it to keep such behavior after
> Parallel Append.  I am not sure if this is the right time to consider
> it or shall we wait till Parallel Append is committed.
>
>> I think the problem here is that compute_parallel_worker() thinks it
>> can use 0 as a sentinel value that means "ignore this", but it can't
>> really, because a heap can actually contain 0 pages.  Here's a patch
>> which does the following:
>>
>
> - if (heap_pages < (BlockNumber) min_parallel_table_scan_size &&
> - index_pages < (BlockNumber) min_parallel_index_scan_size &&
> - rel->reloptkind == RELOPT_BASEREL)
> + if (rel->reloptkind == RELOPT_BASEREL &&
> + ((heap_pages >= 0 && heap_pages < min_parallel_table_scan_size) ||
> + (index_pages >= 0 && index_pages < min_parallel_index_scan_size)))
>   return 0;
>
> The purpose of considering both heap and index pages () for
> min_parallel_* is that even if one of them is bigger than threshold
> then we should try parallelism.

Yes, But, this is only true for normal tables not for partitioned or
inheritance tables. I think for partition table, even if one heap page
exist, we go for parallelism.

 This could be helpful for parallel
> index scans where we consider parallel workers based on both heap and
> index pages.  Is there a reason for you to change that condition as
> that is not even related to the problem being discussed?
>

I think he has changed to allow parallelism for inheritance or partition
tables. For normal tables, it won't be touched until the below if-condition
is satisfied.




*if (heap_pages < (BlockNumber) min_parallel_table_scan_size &&
index_pages < (BlockNumber) min_parallel_index_scan_size &&
rel->reloptkind == RELOPT_BASEREL)return 0;*

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: [HACKERS] increasing the default WAL segment size

2017-03-06 Thread Ashutosh Sharma

Hi,

I took a look at this patch. Overall, the patch looks good to me.
However, there are some review comments that I would like to share,

1. I think the macro 'PATH_MAX' used in pg_waldump.c file is specific
to Linux. It needs to be changed to some constant value or may be
MAXPGPATH inorder to make it platform independent.

2. As already mentioned by Jim and Kuntal upthread, you are trying to
detect the configured WAL segment size in pg_waldump.c and
pg_standby.c files based on the size of the random WAL file which
doesn't look like a good idea. But, then I think the only option we
have is to pass the location of pg_control file to pg_waldump module
along with the start and end wal segments.

3. When trying to compile '02-initdb-walsegsize-v2.patch' on Windows,
I got this warning message,

Warning1warning C4005: 'DEFAULT_XLOG_SEG_SIZE' : macro
redefinition
c:\users\ashu\postgresql\src\include\pg_config_manual.h20

Apart from these, I am not having any comments as of now. I am still
validating the patch on Windows. If I find any issues i will update
it.

--
With Regards,
Ashutosh Sharma.
EnterpriseDB: http://www.enterprisedb.com

On Tue, Feb 28, 2017 at 10:36 AM, Beena Emerson  wrote:
>
>
> On Tue, Feb 28, 2017 at 9:45 AM, Jim Nasby  wrote:
>>
>> On 2/24/17 6:30 AM, Kuntal Ghosh wrote:
>>>
>>> * You're considering any WAL file with a power of 2 as valid. Suppose,
>>> the correct WAL seg size is 64mb. For some reason, the server
>>> generated a 16mb invalid WAL file(maybe it crashed while creating the
>>> WAL file). Your code seems to treat this as a valid file which I think
>>> is incorrect. Do you agree with that?
>>
>>
>> Detecting correct WAL size based on the size of a random WAL file seems
>> like a really bad idea to me.
>>
>>
>> I also don't see the reason for #2... or is that how initdb writes out the
>> correct control file?
>
>
> The initdb module reads the size from the option provided and sets the
> environment variable. This variable is read in
> src/backend/access/transam/xlog.c and the ControlFile written.
> Unlike pg_resetwal and pg_rewind, pg_basebackup cannot access the Control
> file. It only accesses the wal log folder.  So we get the XLogSegSize from
> the SHOW command using  replication connection.
> As Kuntal pointed out, I might need to set it from  pg_receivewal.c as well.
>
> Thank you,
>
> Beena Emerson
>
> EnterpriseDB: https://www.enterprisedb.com/
> The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Parallel seq. plan is not coming against inheritance or partition table

2017-03-06 Thread Ashutosh Sharma

On Tue, Mar 7, 2017 at 11:18 AM, Amit Kapila  wrote:
> On Tue, Mar 7, 2017 at 10:22 AM, Ashutosh Sharma  
> wrote:
>>> I also think that commit didn't intend to change the behavior,
>>> however, the point is how sensible is it to keep such behavior after
>>> Parallel Append.  I am not sure if this is the right time to consider
>>> it or shall we wait till Parallel Append is committed.
>>>
>>>> I think the problem here is that compute_parallel_worker() thinks it
>>>> can use 0 as a sentinel value that means "ignore this", but it can't
>>>> really, because a heap can actually contain 0 pages.  Here's a patch
>>>> which does the following:
>>>>
>>>
>>> - if (heap_pages < (BlockNumber) min_parallel_table_scan_size &&
>>> - index_pages < (BlockNumber) min_parallel_index_scan_size &&
>>> - rel->reloptkind == RELOPT_BASEREL)
>>> + if (rel->reloptkind == RELOPT_BASEREL &&
>>> + ((heap_pages >= 0 && heap_pages < min_parallel_table_scan_size) ||
>>> + (index_pages >= 0 && index_pages < min_parallel_index_scan_size)))
>>>   return 0;
>>>
>>> The purpose of considering both heap and index pages () for
>>> min_parallel_* is that even if one of them is bigger than threshold
>>> then we should try parallelism.
>>
>> Yes, But, this is only true for normal tables not for partitioned or
>> inheritance tables. I think for partition table, even if one heap page
>> exist, we go for parallelism.
>>
>>  This could be helpful for parallel
>>> index scans where we consider parallel workers based on both heap and
>>> index pages.  Is there a reason for you to change that condition as
>>> that is not even related to the problem being discussed?
>>>
>>
>> I think he has changed to allow parallelism for inheritance or partition
>> tables. For normal tables, it won't be touched until the below if-condition
>> is satisfied.
>>
>> if (heap_pages < (BlockNumber) min_parallel_table_scan_size &&
>> index_pages < (BlockNumber) min_parallel_index_scan_size &&
>> rel->reloptkind == RELOPT_BASEREL)
>> return 0;
>>
>
> AFAICS, the patch has changed the if-condition you are quoting.
>

Yes, It has slightly changed the if condition or I would say it has
corrected the if condition. The reason being, with new if condition in
patch, we first check if reloptkind is of BASEREL type or not before
checking if there are enough heap pages or index pages that is good to
go for parallelism. In case if it is partition table 'rel->reloptkind
== RELOPT_BASEREL' will fail hence, further conditions won't be
checked.

I think the most important thing that the patch fixes is passing
index_pages as '-1' to compute_parallel_worker() as
'min_parallel_index_scan_size' can be set to zero by the user.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Supporting huge pages on Windows

2017-03-08 Thread Ashutosh Sharma

Hi,

I tried to test v8 version of patch. Firstly, I was able to start the
postgresql server process  with 'huge_pages' set to on. I had to
follow the instructions given in MSDN[1] to enable lock pages in
memory option and also had to start the postgresql server process as
admin user.

test=# show huge_pages;
 huge_pages

 on
(1 row)

To start with, I ran the regression test-suite and didn't find any
failures. But, then I am not sure if huge_pages are getting used or
not. However, upon checking the settings for huge_pages and I found it
as 'on'. I am assuming, if huge pages is not being used due to
shortage of large pages, it should have fallen back to non-huge pages.

I also ran the pgbench tests on read-only workload and here are the
results I got.

pgbench -c 4 -j 4 - T 600 bench

huge_pages=on, TPS = 21120.768085
huge_pages=off, TPS = 20606.288995

[1] - https://msdn.microsoft.com/en-IN/library/ms190730.aspx

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Feb 23, 2017 at 12:59 PM, Tsunakawa, Takayuki
 wrote:
> From: Amit Kapila [mailto:amit.kapil...@gmail.com]
>> > Hmm, the large-page requires contiguous memory for each page, so this
>> error could occur on a long-running system where the memory is heavily
>> fragmented.  For example, please see the following page and check the memory
>> with RAMMap program referred there.
>> >
>>
>> I don't have RAMMap and it might take some time to investigate what is going
>> on, but I think in such a case even if it works we should keep the default
>> value of huge_pages as off on Windows.  I request somebody else having
>> access to Windows m/c to test this patch and if it works then we can move
>> forward.
>
> You are right.  I modified the patch so that the code falls back to the 
> non-huge page when CreateFileMapping() fails due to the shortage of large 
> pages.  That's what the Linux version does.
>
> The other change is to parameterize the Win32 function names in the messages 
> in EnableLockPagePrivileges().  This is to avoid adding almost identical 
> messages unnecessarily.  I followed Alvaro's comment.  I didn't touch the two 
> existing sites that embed Win32 function names.  I'd like to leave it up to 
> the committer to decide whether to change as well, because changing them 
> might make it a bit harder to apply some bug fixes to earlier releases.
>
> FYI, I could reproduce the same error as Amit on 32-bit Win7, where the total 
> RAM is 3.5 GB and available RAM is 2 GB.  I used the attached largepage.c.  
> Immediately after the system boot, I could only allocate 8 large pages.  When 
> I first tried to allocate 32 large pages, the test program produced:
>
> large page size = 2097152
> allocating 32 large pages...
> CreateFileMapping failed: error code = 1450
>
> You can build the test program as follows:
>
> cl largepage.c advapi32.lib
>
> Regards
> Takayuki Tsunakawa
>
>
>
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Should we cacheline align PGXACT?

2017-03-10 Thread Ashutosh Sharma

Hi,

>
> I agree.  Probably Simon's patch for reducing pgxact access could negate
> regression in pgxact alignment patch.
> Ashutosh, could please you run read-write and read-only tests when both
> these patches applied?

I already had the results with both the patches applied. But, as I was not
quite
able to understand on how Simon's patch for reducing pgxact access could
negate the regression on read-write workload that we saw with pgxact-align-3
patch earlier, I was slightly hesitant to share the results. Anyways, here
are
the results with combined patches on readonly and readwrite workload:

1) Results for read-only workload:

pgbench -i -s 300 postgres
pgbench -M prepared -c $thread -j $thread -T $time_for_reading -S postgres

where, time_for_reading = 10mins
*non default param:*
shared_buffers=8GB
max_connections=300

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
4 36333 36835 1.381664052
8 70179 72496 3.301557446
16 169303 175743 3.803831001
32 328837 341093 3.727074508
64 363352 399847 10.04397939
72 372011 413437 11.13569222
128 443979 578355 30.26629638
180 321420 552373 71.85396055
196 276780 558896 101.927885
256 234563 568951 142.5578629

2) Results for read-write workload:
=
pgbench -i -s 300 postgres
pgbench -M prepared -c $thread -j $thread -T $time_for_reading  postgres

where, time_for_reading = 30mins

non default param:
shared_buffers=8GB
max_connections=300


CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
4 2683 2690 0.2609019754
8 5321 5332 0.2067280586
16 10348 10387 0.3768844221
32 19446 19754 1.58387329
64 28178 28198 0.0709773582
72 28296 28639 1.212185468
128 28577 28600 0.0804843056
180 26665 27525 3.225201575
196 27628 28511 3.19603301
256 28467 28529 0.2177960445

HEAD is basically referring to the following git commit in master branch,

commit 5dbdb2f799232cb1b6df7d7a85d59ade3234d30c
Author: Robert Haas 
Date:   Fri Feb 24 12:21:46 2017 +0530

Make tablesample work with partitioned tables.

This was an oversight in the original partitioning commit.

Amit Langote, reviewed by David Fetter

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: [HACKERS] Page Scan Mode in Hash Index

2017-03-12 Thread Ashutosh Sharma

Hi,

>
> I've assigned to review this patch.
> At first, I'd like to notice that I like idea and general design.
> Secondly, patch set don't apply cleanly to master.  Please, rebase it.


Thanks for showing your interest towards this patch. I would like to
inform that this patch has got dependency on patch for  'Write Ahead
Logging in hash index - [1]' and 'Microvacuum support in hash index
[1]'. Hence, until above two patches becomes stable I may have to keep
on rebasing this patch. However, I will try to share you the updated
patch asap.

>
>
> On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma  
> wrote:
>>
>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>> patch rewrites the hash index scan module to work in page-at-a-time
>> mode. It basically introduces two new functions-- _hash_readpage() and
>> _hash_saveitem(). The former is used to load all the qualifying tuples
>> from a target bucket or overflow page into an items array. The latter
>> one is used by _hash_readpage to save all the qualifying tuples found
>> in a page into an items array. Apart from that, this patch bascially
>> cleans _hash_first(), _hash_next and hashgettuple().
>
>
> I see that forward and backward scan cases of function _hash_readpage contain 
> a lot of code duplication
>  Could you please refactor this function to have less code duplication?

Sure, I will try to avoid the code duplication as much as possible.

>
> Also, I wonder if you have a special idea behind inserting data in test.sql 
> by 1002 separate SQL statements.
> INSERT INTO con_hash_index_table (keycol) SELECT a FROM GENERATE_SERIES(1, 
> 1000) a;
>
> You can achieve the same result by execution of single SQL statement.
> INSERT INTO con_hash_index_table (keycol) SELECT (a - 1) % 1000 + 1 FROM 
> GENERATE_SERIES(1, 1002000) a;
> Unless you have some special idea of doing this in 1002 separate transactions.


There is no reason for having so many INSERT statements in test.sql
file. I think it would be better to replace it with single SQL
statement. Thanks.

[1]- 
https://www.postgresql.org/message-id/CAA4eK1KibVzgVETVay0%2BsiVEgzaXnP5R21BdWiK9kg9wx2E40Q%40mail.gmail.com
[2]- 
https://www.postgresql.org/message-id/CAE9k0PkRSyzx8dOnokEpUi2A-RFZK72WN0h9DEMv_ut9q6bPRw%40mail.gmail.com


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)

2017-03-12 Thread Ashutosh Sharma

Hi,

I had a look into this patch and would like to share some of my review
comments that requires author's attention.

1) The comment for page_checksum() needs to be corrected. It seems
like it has been copied from page_header and not edited it further.

/*
 * page_header
 *
 * Allows inspection of page header fields of a raw page
 */

PG_FUNCTION_INFO_V1(page_checksum);

Datum
page_checksum(PG_FUNCTION_ARGS)

2) It seems like you have choosen wrong datatype for page_checksum. I
am getting negative checksum value when trying to run below query. I
think the return type for the SQL function page_checksum should be
'integer' instead of 'smallint'.

postgres=# SELECT page_checksum(get_raw_page('pg_class', 0), 100);
 page_checksum
---
-19532
(1 row)

Above problem also exist in case of page_header. I am facing similar
problem with page_header as well.

postgres=# SELECT page_header(get_raw_page('pg_class', 0));
 page_header
-
 (0/2891538,-28949,1,220,7216,8192,8192,4,0)
(1 row)


3) Any reason for not marking bt_page_items or page_checksum as PARALLEL_SAFE.

4) Error messages in new bt_page_items are not consistent with old
bt_page_items. For eg. if i pass meta page blocknumber as input to
bt_page_items the error message thrown by new and old bt_page_items
are different.

postgres=# SELECT * FROM bt_page_items('btree_index', 0) LIMIT 1;
ERROR:  block 0 is a meta page

postgres=# SELECT * FROM bt_page_items(get_raw_page('btree_index', 0)) LIMIT 1;
ERROR:  block is a meta page


postgres=# SELECT * FROM bt_page_items(get_raw_page('btree_index',
1024)) LIMIT 1;
ERROR:  block number 1024 is out of range for relation "btree_index"


postgres=# SELECT * FROM bt_page_items('btree_index', 1024) LIMIT 1;
ERROR:  block number out of range

5) Code duplication in bt_page_items() and bt_page_items_bytea() needs
to be handled.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Tue, Mar 7, 2017 at 9:39 PM, Peter Eisentraut
 wrote:
> On 3/6/17 16:33, Tomas Vondra wrote:
>>> I think it would be better not to maintain so much duplicate code
>>> between bt_page_items(text, int) and bt_page_items(bytea).  How about
>>> just redefining bt_page_items(text, int) as an SQL-language function
>>> calling bt_page_items(get_raw_page($1, $2))?
>>>
>>
>> Maybe. We can also probably share the code at the C level, so I'll look
>> into that.
>
> I think SQL would be easier, but either way some reduction in
> duplication would be good.
>
>>> For page_checksum(), the checksums are internally unsigned, so maybe
>>> return type int on the SQL level, so that there is no confusion about
>>> the sign?
>>>
>>
>> That ship already sailed, I'm afraid. We already have checksum in the
>> page_header() output, and it's defined as smallint there. So using int
>> here would be inconsistent.
>
> OK, no worries then.
>
> --
> Peter Eisentraut  http://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL Consistency checking for hash indexes

2017-03-13 Thread Ashutosh Sharma

Couple of review comments,,

You may also need to update the documentation as now we are also going
to support wal consistency check for hash index. The current
documentation does not include hash index.

+only records originating from those resource managers.  Currently,
+the supported resource managers are heap,
+heap2, btree, gin,
+gist, sequence, spgist,
+brin, and generic. Only


Following comment in hash_mask() may require changes if patch for
'Microvacuum support for Hash Index - [1]' gets committed.

+   /*
+* In hash bucket and overflow pages, it is possible to modify the
+* LP_FLAGS without emitting any WAL record. Hence, mask the line
+* pointer flags.
+* See hashgettuple() for details.
+*/


[1] - 
https://www.postgresql.org/message-id/CAE9k0PmXyQpHX8%3DL_hFV7HfPV8qrit19xoUB86waQ87rKYzmYQ%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Wed, Mar 8, 2017 at 2:32 PM, Kuntal Ghosh  wrote:
> On Fri, Mar 3, 2017 at 9:44 AM, Amit Kapila  wrote:
>> On Tue, Feb 28, 2017 at 11:06 AM, Kuntal Ghosh
>>  wrote:
>>> Hello everyone,
>>>
>>> I've attached a patch which implements WAL consistency checking for
>>> hash indexes. This feature is going to be useful for developing and
>>> testing of WAL logging for hash index.
>>>
>>
>> 2.
>> + else if ((opaque->hasho_flag & LH_BUCKET_PAGE) ||
>> + (opaque->hasho_flag & LH_OVERFLOW_PAGE))
>> + {
>> + /*
>> + * In btree bucket and overflow pages, it is possible to modify the
>> + * LP_FLAGS without emitting any WAL record. Hence, mask the line
>> + * pointer flags.
>> + * See hashgettuple() for details.
>> + */
>> + mask_lp_flags(page);
>> + }
>>
>> Again, this mechanism is also modified by patch "Microvacuum support
>> for hash index", so above changes needs to be adjusted accordingly.
>> Comment referring to btree is wrong, you need to refer hash.
> I've corrected the text in the comment and re-based the patch on the
> latest hash index patch for WAL logging[1]. As discussed in the
> thread, Microvacuum patch can be re-based on top of this patch.
>
>
> [1] 
> https://www.postgresql.org/message-id/CAA4eK1%2BmvCucroWQwX3S7aBR%3D0yBJGF%2BjQz4x4Cx9QVsMFTZUw%40mail.gmail.com
> --
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)

2017-03-13 Thread Ashutosh Sharma

Hi,

>
>> 2) It seems like you have choosen wrong datatype for page_checksum. I
>> am getting negative checksum value when trying to run below query. I
>> think the return type for the SQL function page_checksum should be
>> 'integer' instead of 'smallint'.
>>
>> postgres=# SELECT page_checksum(get_raw_page('pg_class', 0), 100);
>>  page_checksum
>> ---
>> -19532
>> (1 row)
>>
>> Above problem also exist in case of page_header. I am facing similar
>> problem with page_header as well.
>>
>> postgres=# SELECT page_header(get_raw_page('pg_class', 0));
>>  page_header
>> -
>>  (0/2891538,-28949,1,220,7216,8192,8192,4,0)
>> (1 row)
>>
>
> No. This is consistent with page_header() which is also using smallint for
> the checksum value.
>

Yes. But, as i said earlier I am getting negative checksum value for
page_header as well. Isn't that wrong. For eg. When I debug the
following query, i could pd_checksum value as '40074' in gdb where
page_header shows it as '-25462'.

SELECT page_header(get_raw_page('pg_class', 0));

(gdb) ppage->pd_checksum
$2 = 40074

postgres=# SELECT page_header(get_raw_page('pg_class', 0));
 page_header
-
 (0/304EDE0,-25462,1,220,7432,8192,8192,4,0)
(1 row)

I think pd_checksum in PageHeaderData is defined as uint16 (0 to
65,535) whereas in SQL function for page_header it is defined as
smallint (-32768 to +32767).

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-14 Thread Ashutosh Sharma

Hi,

Attached is the v6 patch for microvacuum in hash index rebased on top
of 'v10 patch for WAL in hash index - [1]' and 'v1 patch for WAL
consistency check for hash index - [2]'.

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2Bk5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn%2B72o1UA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/flat/cagz5qcjlerun_zoo0edv6_y_d0o4tntmper7ivtlbg4rurj...@mail.gmail.com#cagz5qcjlerun_zoo0edv6_y_d0o4tntmper7ivtlbg4rurj...@mail.gmail.com

Also, the patch (mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patch) to mask
'LH_PAGE_HAS_DEAD_TUPLES' flag which got added as a part of
Microvacuum patch is attached with this mail.


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Wed, Feb 1, 2017 at 10:30 AM, Michael Paquier
 wrote:
> On Sat, Jan 28, 2017 at 8:02 PM, Amit Kapila  wrote:
>> On Fri, Jan 27, 2017 at 5:15 PM, Ashutosh Sharma  
>> wrote:
>>>>
>>>> Don't you think we should try to identify the reason of the deadlock
>>>> error reported by you up thread [1]?  I know that you and Ashutosh are
>>>> not able to reproduce it, but still I feel some investigation is
>>>> required to find the reason.  It is quite possible that the test case
>>>> is such that the deadlock is expected in rare cases, if that is the
>>>> case then it is okay.  I have not spent enough time on that to comment
>>>> whether it is a test or code issue.
>>>
>>> I am finally able to reproduce the issue using the attached script
>>> file (deadlock_report). Basically, once I was able to reproduce the
>>> issue with hash index I also thought of checking it with a non unique
>>> B-Tree index and was able to reproduce it with B-Tree index as well.
>>> This certainly tells us that there is nothing wrong at the code level
>>> rather there is something wrong with the test script which is causing
>>> this deadlock issue. Well, I have ran pgbench with two different
>>> configurations and my observations are as follows:
>>>
>>> 1) With Primary keys i.e. uinque values: I have never encountered
>>> deadlock issue with this configuration.
>>>
>>> 2) With non unique indexes (be it hash or B-Tree): I have seen
>>> deadlock many a times with this configuration. Basically when we have
>>> non-unique values associated with a column then there is a high
>>> probability that multiple records will get updated with a single
>>> 'UPDATE' statement and when there are huge number of backends trying
>>> to do so there is high chance of getting deadlock which i assume is
>>> expected behaviour in database.
>>>
>>
>> I agree with your analysis, surely trying to update multiple rows with
>> same values from different backends can lead to deadlock.
>
> Moved that to CF 2017-03.
> --
> Michael


microvacuum_hash_index_v6.patch
Description: application/download


mask_hint_bit_LH_PAGE_HAS_DEAD_TUPLES.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Scan Mode in Hash Index

2017-03-14 Thread Ashutosh Sharma

Hi,

>> I've assigned to review this patch.
>> At first, I'd like to notice that I like idea and general design.
>> Secondly, patch set don't apply cleanly to master.  Please, rebase it.
>
>
> Thanks for showing your interest towards this patch. I would like to
> inform that this patch has got dependency on patch for  'Write Ahead
> Logging in hash index - [1]' and 'Microvacuum support in hash index
> [1]'. Hence, until above two patches becomes stable I may have to keep
> on rebasing this patch. However, I will try to share you the updated
> patch asap.
>
>>
>>
>> On Tue, Feb 14, 2017 at 8:27 AM, Ashutosh Sharma  
>> wrote:
>>>
>>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>>> patch rewrites the hash index scan module to work in page-at-a-time
>>> mode. It basically introduces two new functions-- _hash_readpage() and
>>> _hash_saveitem(). The former is used to load all the qualifying tuples
>>> from a target bucket or overflow page into an items array. The latter
>>> one is used by _hash_readpage to save all the qualifying tuples found
>>> in a page into an items array. Apart from that, this patch bascially
>>> cleans _hash_first(), _hash_next and hashgettuple().
>>
>>
>> I see that forward and backward scan cases of function _hash_readpage 
>> contain a lot of code duplication
>>  Could you please refactor this function to have less code duplication?
>
> Sure, I will try to avoid the code duplication as much as possible.

I had close look into hash_readpage() function and could see that
there are only few if-else conditions which are similar for both
forward and backward scan cases and can't be optimised further.
However, If you have a cursory look into this function definition it
looks like the code for forward and backward scan are exactly the same
but that's not the case. Attached is the diff report
(hash_readpage.html) for forward and backward scan code used in
hash_readpage(). This shows what all lines in the hash_readpage() are
same or different.

Please note that before applying the patches for page scan mode in
hash index you may need to first apply the following patches on HEAD,

1) v10 patch for WAL in hash index - [1]
2) v1 patch for wal consistency check for hash index - [2]
3) v6 patch for microvacuum support in hash index - [3]

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2Bk5wR4-kAjPqLoKemuHayQd6RkQQT9gheTfpn%2B72o1UA%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAGz5QCKPU2qX75B1bB_LuEC88xWZa5L55J0TLvYMVD8noSH3pA%40mail.gmail.com
[3] - 
https://www.postgresql.org/message-id/CAE9k0PkYpAPDJBfgia08o7XhO8nypH9WoO9M8%3DdqLrwwObXKcw%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
Title: Text Compare




Text Compare
Produced: 3/14/2017 4:32:14 PM
   

Mode:  All  

Left file: C:\Users\ashu\Downloads\fwd_scan.txt     Right file: C:\Users\ashu\Downloads\bwd_scan.txt  



+loop_top_fwd:
<>
+loop_top_bwd:


+               /* load items[] in ascending order */
 
+               /* load items[] in descending order */


+               itemIndex = 0;
 
+               itemIndex = MaxIndexTuplesPerPage;


+
=
+


+               /* new page, locate starting position by binary search */
 
+               /* new page, locate starting position by binary search */


+               offnum = _hash_binsearch(page, so->hashso_sk_hash);
<>
+               offnum = _hash_binsearch_last(page, so->hashso_sk_hash);


+
=
+


+               while (offnum <= maxoff)
<>
+               while (offnum >= FirstOffsetNumber)


+               {
=
+               {


+                       Assert(offnum >= FirstOffsetNumber);
<>
+                       Assert(offnum <= maxoff);


+                       itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
=
+                       itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));


+
 
+


+                       /*
 
+                       /*


+                        * skip the tuples that are moved by split operation
 
+                        * skip the tuples that are moved by split operation


+                        * for the scan that has started when split was in
 
+                        * for the scan that has started when split was in


+                        * progress. Also, skip the tuples that are marked
 
+                        * progress. Also, skip the tuples that are marked


+                        * as dead.
 
+                        * as dead.


+                        */
 
+                        */


+                       if (

Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)

2017-03-14 Thread Ashutosh Sharma

On Mar 14, 2017 5:37 PM, "Alvaro Herrera"  wrote:

Ashutosh Sharma wrote:

> Yes. But, as i said earlier I am getting negative checksum value for
> page_header as well. Isn't that wrong. For eg. When I debug the
> following query, i could pd_checksum value as '40074' in gdb where
> page_header shows it as '-25462'.

Yes; the point is that this is a pre-existing problem, not one being
introduced with this patch.  In SQL we don't have unsigned types, and we
have historically chosen a type with the same width instead of one with
enough width to represent all possible unsigned values.  IMO this patch
is doing the right thing; if in the future we want to change that
decision, that's fine but it's a separate patch

Okay, understood . Thank you.

With Regards,
Ashutosh Sharma

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-15 Thread Ashutosh Sharma

> Generally, this patch looks like a pretty straightforward adaptation
> of the similar btree mechanism to hash indexes, so if it works for
> btree it ought to work here, too.  But I noticed a few things while
> reading through it.
>
> +/* Get RelfileNode from relation OID */
> +rel = relation_open(htup->t_tableOid, NoLock);
> +rnode = rel->rd_node;
> +relation_close(rel, NoLock);
>  itup->t_tid = htup->t_self;
> -_hash_doinsert(index, itup);
> +_hash_doinsert(index, itup, rnode);
>
> This is an awfully low-level place to be doing something like this.
> I'm not sure exactly where this should be happening, but not in the
> per-tuple callback.

Okay, Now I have done this inside _hash_doinsert() instead of callback
function. Please have a look into the attached v7 patch.

>
> +/*
> + * If there's nothing running on the standby we don't need to derive a
> + * full latestRemovedXid value, so use a fast path out of here.  This
> + * returns InvalidTransactionId, and so will conflict with all HS
> + * transactions; but since we just worked out that that's zero people,
> + * it's OK.
> + */
> +if (CountDBBackends(InvalidOid) == 0)
> +return latestRemovedXid;
>
> I see that this comment (and most of what surrounds it) was just
> copied from the existing btree example, but isn't there a discrepancy
> between the comment and the code?  It says it returns
> InvalidTransactionId, but it doesn't.  Also, you dropped the XXX from
> the btree original, and the following reachedConsistency check.

It does return InvalidTransactionId if there are no backends running
across any database in the standby. As shown below 'latestRemovedXid'
is initialised with InvalidTransactionId,

TransactionId   latestRemovedXid = InvalidTransactionId;

So, If there are no backend processes running across any databases in
standby latestRemovedXid will be returned as it is.

I have also added the note in XXX in above comment. Please check v7
patch attached with this mail.


>
> + * Hash Index delete records can conflict with standby queries.You might
> + * think that vacuum records would conflict as well, but we've handled
>
> But they're not called delete records in a hash index.  The function
> is called hash_xlog_vacuum_one_page.  The corresponding btree function
> is btree_xlog_delete.  So this comment needs a little more updating.

Okay, I have tried to rephrase it to avoid the confusion.

>
> +if (IsBufferCleanupOK(bucket_buf))
> +{
> +_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
> +  (buf == bucket_buf) ? true : false,
> +  hnode);
> +if (bucket_buf != buf)
> +LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
> +
> +if (PageGetFreeSpace(page) >= itemsz)
> +break;  /* OK, now we have enough space */
> +}
>
> I might be missing something, but I don't quite see why this needs a
> cleanup lock on the primary bucket page.  I would think a cleanup lock
> on the page we're actually modifying would suffice, and I think if
> that's correct it would be a better way to go.  If that's not correct,
> then I think the comments needs some work.

Thanks for your that suggestion... I spent a lot of time thinking on
this and also had a small discussion with Amit but could not find any
issue with taking cleanup lock on modified page instead of primary
bucket page.I had to do some decent code changes for this. Attached v7
patch has the changes.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


microvacuum_hash_index_v7.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-15 Thread Ashutosh Sharma

>
> I think one possibility is to get it using
> indexrel->rd_index->indrelid in _hash_doinsert().
>

Thanks. I have tried the same in the v7 patch shared upthread.

>
>>
>> But they're not called delete records in a hash index.  The function
>> is called hash_xlog_vacuum_one_page.  The corresponding btree function
>> is btree_xlog_delete.  So this comment needs a little more updating.
>>
>> +if (IsBufferCleanupOK(bucket_buf))
>> +{
>> +_hash_vacuum_one_page(rel, metabuf, buf, bucket_buf,
>> +  (buf == bucket_buf) ? true : false,
>> +  hnode);
>> +if (bucket_buf != buf)
>> +LockBuffer(bucket_buf, BUFFER_LOCK_UNLOCK);
>> +
>> +if (PageGetFreeSpace(page) >= itemsz)
>> +break;  /* OK, now we have enough space */
>> +}
>>
>> I might be missing something, but I don't quite see why this needs a
>> cleanup lock on the primary bucket page.  I would think a cleanup lock
>> on the page we're actually modifying would suffice, and I think if
>> that's correct it would be a better way to go.
>>
>
> Offhand, I also don't see any problem with it.

I too found no problem with that...

>
> Few other comments:
> 1.
> + if (ndeletable > 0)
> + {
> + /* No ereport(ERROR) until changes are logged */
> + START_CRIT_SECTION();
> +
> + PageIndexMultiDelete(page, deletable, ndeletable);
> +
> + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
> + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
>
> You clearing this flag while logging the action, but same is not taken
> care during replay. Any reasons?

That's because we  conditionally WAL Log this flag status and when we
do so, we take a it's FPI.

>
> 2.
> + /* No ereport(ERROR) until changes are logged */
> + START_CRIT_SECTION
> ();
> +
> + PageIndexMultiDelete(page, deletable, ndeletable);
> +
> + pageopaque =
> (HashPageOpaque) PageGetSpecialPointer(page);
> + pageopaque->hasho_flag &=
> ~LH_PAGE_HAS_DEAD_TUPLES;
> +
> + /*
> + * Write-lock the meta page so that we can
> decrement
> + * tuple count.
> + */
> + LockBuffer(metabuf,
> BUFFER_LOCK_EXCLUSIVE);
>
> The lock on buffer should be acquired before critical section.

Point taken. I have taken care of it in the v7 patch.

>
> 3.
> -
> /*
> - * Since this can be redone later if needed, mark as a
> hint.
> - */
> - MarkBufferDirtyHint(buf, true);
> +
> if (so->numKilled < MaxIndexTuplesPerPage)
> + {
> + so-
>>killedItems[so->numKilled].heapTid = so->hashso_heappos;
> + so-
>>killedItems[so->numKilled].indexOffset =
> +
> ItemPointerGetOffsetNumber(&(so->hashso_curpos));
> + so->numKilled++;
> +
> }
>
> By looking at above code, the first thing that comes to mind is when
> numKilled can become greater than MaxIndexTuplesPerPage and why we are
> ignoring the marking of dead tuples when it becomes greater than
> MaxIndexTuplesPerPage.  After looking at similar btree code, I realize
> that it could
> happen if user reverses the scan direction.  I think you should
> mention in comments that see btgettuple to know the reason of
> numKilled overun test or something like that.

Added comment. Please refer to v7 patch.

>
> 4.
> + * We match items by heap TID before assuming they are the right ones to
> + * delete. If an item has
> moved off the current page due to a split, we'll
> + * fail to find it and do nothing (this is not an
> error case --- we assume
> + * the item will eventually get marked in a future indexscan).
> + */
> +void
> +_hash_kill_items(IndexScanDesc scan)
>
> I think this comment doesn't make much sense for hash index because
> item won't move off from the current page due to split, only later
> cleanup can remove it.

Yes. The reason being, no cleanup will happen when scan in progress.
Corrected it .

>
> 5.
>
> +
>  /*
>   * Maximum size of a hash index item (it's okay to have only one per page)
>
> Spurious white space change.

Fixed.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-15 Thread Ashutosh Sharma

On Wed, Mar 15, 2017 at 9:28 PM, Robert Haas  wrote:
> On Wed, Mar 15, 2017 at 11:37 AM, Ashutosh Sharma  
> wrote:
>>> +/* Get RelfileNode from relation OID */
>>> +rel = relation_open(htup->t_tableOid, NoLock);
>>> +rnode = rel->rd_node;
>>> +relation_close(rel, NoLock);
>>>  itup->t_tid = htup->t_self;
>>> -_hash_doinsert(index, itup);
>>> +_hash_doinsert(index, itup, rnode);
>>>
>>> This is an awfully low-level place to be doing something like this.
>>> I'm not sure exactly where this should be happening, but not in the
>>> per-tuple callback.
>>
>> Okay, Now I have done this inside _hash_doinsert() instead of callback
>> function. Please have a look into the attached v7 patch.
>
> In the btree case, the heap relation isn't re-opened from anywhere in
> the btree code.  I think we should try to do the same thing here.  If
> we add an argument for the heap relation to _hash_doinsert(),
> hashinsert() can easily it down; it's already got that value
> available.  There are two other calls to _hash_doinsert:
>
> 1. _h_indexbuild() calls _hash_doinsert().  It's called only from
> hashbuild(), which has the heap relation available.  So we can just
> add that as an extra argument to _h_indexbuild() and then from there
> pass it to _hash_doinsert.
>
> 2. hashbuildCallback calls _hash_doinsert().  It's sixth argument is a
> HashBuildState which is set up by hashbuild(), which has the heap
> relation available.  So we can just add an extra member to the
> HashBuildState and have hashbuild() set it before calling
> IndexBuildHeapScan.  hashbuildCallback can then fish it out of the
> HashBuildState and pass it to _hash_doinsert().

Okay, I have done the changes as suggested by you. Please refer to the
attached v8 patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


microvacuum_hash_index_v8.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-15 Thread Ashutosh Sharma

> +action = XLogReadBufferForRedo(record, 0, &buffer);
> +
> +if (!IsBufferCleanupOK(buffer))
> +elog(PANIC, "hash_xlog_vacuum_one_page: failed to acquire
> cleanup lock");
>
> That could fail, I think, because of a pin from a Hot Standby backend.
> You want to call XLogReadBufferForRedoExtended() with a third argument
> of true.

Yes, there is a possibility that a new backend may start in standby
after we kill the conflicting backends. If the new backend has pin on
the buffer which the startup process is trying to read then
'IsBufferCleanupOK' will fail thereby causing a startup process to
PANIC.

 Come to think of it, shouldn't hash_xlog_split_allocate_page
> be changed the same way?

No, the reason being we are trying to allocate a new bucket page on
standby so there can't be any backend which could have pin on a page
that is yet to initialised.

>
> +/*
> + * Let us mark the page as clean if vacuum removes the DEAD 
> tuples
> + * from an index page. We do this by clearing
> LH_PAGE_HAS_DEAD_TUPLES
> + * flag.
> + */
>
> Maybe add: Clearing this flag is just a hint; replay won't redo this.

Added. Please check the attached v9 patch.

>
> + * Hash Index records that are marked as LP_DEAD and being removed during
> + * hash index tuple insertion can conflict with standby queries.You might
>
> The word Index shouldn't be capitalized here.  There should be a space
> before "You".

Corrected.

>
> The formatting of this comment is oddly narrow:
>
> + * _hash_vacuum_one_page - vacuum just one index page.
> + * Try to remove LP_DEAD items from the given page.  We
> + * must acquire cleanup lock on the page being modified
> + * before calling this function.
>
> I'd add a blank line after the first line and reflow the rest to be
> something more like 75 characters.  pgindent evidently doesn't think
> this needs reformatting, but it's oddly narrow.

Corrected.

>
> I suggest changing the header comment of
> hash_xlog_vacuum_get_latestRemovedXid like this:
>
> + * Get the latestRemovedXid from the heap pages pointed at by the index
> + * tuples being deleted.  See also btree_xlog_delete_get_latestRemovedXid,
> + * on which this function is based.
>
> This is looking good.

Changed as per suggestions. Attached v9 patch. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


microvacuum_hash_index_v9.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-15 Thread Ashutosh Sharma

On Mar 16, 2017 7:49 AM, "Robert Haas"  wrote:

On Wed, Mar 15, 2017 at 4:31 PM, Robert Haas  wrote:
> On Wed, Mar 15, 2017 at 3:54 PM, Ashutosh Sharma 
wrote:
>> Changed as per suggestions. Attached v9 patch. Thanks.
>
> Wow, when do you sleep?   Will have a look.

Committed with a few corrections.


Thanks Robert for the commit. Thank you Amit and Jesper for reviewing this
patch.

With Regards,
Ashutosh Sharma

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-16 Thread Ashutosh Sharma

On Thu, Mar 16, 2017 at 11:11 AM, Amit Kapila  wrote:
> On Wed, Mar 15, 2017 at 9:23 PM, Ashutosh Sharma  
> wrote:
>>
>>>
>>> Few other comments:
>>> 1.
>>> + if (ndeletable > 0)
>>> + {
>>> + /* No ereport(ERROR) until changes are logged */
>>> + START_CRIT_SECTION();
>>> +
>>> + PageIndexMultiDelete(page, deletable, ndeletable);
>>> +
>>> + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
>>> + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
>>>
>>> You clearing this flag while logging the action, but same is not taken
>>> care during replay. Any reasons?
>>
>> That's because we  conditionally WAL Log this flag status and when we
>> do so, we take a it's FPI.
>>
>
> Sure, but we are not clearing in conditionally.  I am not sure, how
> after recovery it will be cleared it gets set during normal operation.
> Moreover, btree already clears similar flag during replay (refer
> btree_xlog_delete).

You were right. In case datachecksum is enabled or wal_log_hint is set
to true, 'LH_PAGE_HAS_DEAD_TUPLES' will get wal logged and therefore
needs to be cleared on the standby as well. Attached is the patch that
clears this flag on standby during replay.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-on-standby-when.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL Consistency checking for hash indexes

2017-03-16 Thread Ashutosh Sharma

Hi,

Attached is the patch that allows WAL consistency tool to mask
'LH_PAGE_HAS_DEAD_TUPLES' flag in hash index. The flag got added as a
part of 'Microvacuum support for Hash index' patch . I have already
tested it using Kuntal's WAL consistency tool and it works fine.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Wed, Mar 15, 2017 at 11:27 AM, Kuntal Ghosh
 wrote:
> On Wed, Mar 15, 2017 at 12:32 AM, Robert Haas  wrote:
>> On Mon, Mar 13, 2017 at 10:36 AM, Ashutosh Sharma  
>> wrote:
>>> Couple of review comments,,
>>>
>>> You may also need to update the documentation as now we are also going
>>> to support wal consistency check for hash index. The current
>>> documentation does not include hash index.
>>>
>>> +only records originating from those resource managers.  Currently,
>>> +the supported resource managers are heap,
>>> +heap2, btree, gin,
>>> +gist, sequence, spgist,
>>> +brin, and generic. Only
>>
>> Did that, committed this.  Also ran pgindent over hash_mask() and
>> fixed an instance of dubious capitalization.
> Thanks Robert for the commit.
>
>
> --
> Thanks & Regards,
> Kuntal Ghosh
> EnterpriseDB: http://www.enterprisedb.com


0001-Allow-WAL-consistency-tool-to-mask-LH_PAGE_HAS_DEAD_.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-16 Thread Ashutosh Sharma

>>> Sure, but we are not clearing in conditionally.  I am not sure, how
>>> after recovery it will be cleared it gets set during normal operation.
>>> Moreover, btree already clears similar flag during replay (refer
>>> btree_xlog_delete).
>>
>> You were right. In case datachecksum is enabled or wal_log_hint is set
>> to true, 'LH_PAGE_HAS_DEAD_TUPLES' will get wal logged and therefore
>> needs to be cleared on the standby as well.
>>
>
> I was thinking what bad can happen if we don't clear this flag during
> replay, the main thing that comes to mind is that after crash
> recovery, if the flag is set the inserts on that page might need to
> traverse all the tuples on that page once the page is full even if
> there are no dead tuples in that page.  It can be later cleared when
> there are dead tuples in that page and we actually delete them, but I
> don't think it is worth the price to pay for not clearing the flag
> during replay.

Yes, you are absolutely correct. If we do not clear this flag  during
replay then there is a possibility of _hash_doinsert() unnecessarily
scanning the page with no space assuming that the page has got some
dead tuples in it which is not true.

>
>> Attached is the patch that
>> clears this flag on standby during replay.
>>
>
> Don't you think, we should also clear it during the replay of
> XLOG_HASH_DELETE?  We might want to log the clear of flag along with
> WAL record for XLOG_HASH_DELETE.
>

Yes, it should be cleared. I completely missed this part in a hurry.
Thanks for informing. I have taken care of it in the attached v2
patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-on-standby-during.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-16 Thread Ashutosh Sharma

On Fri, Mar 17, 2017 at 8:20 AM, Amit Kapila  wrote:
> On Thu, Mar 16, 2017 at 9:39 PM, Ashutosh Sharma  
> wrote:
>>>>
>>>
>>> Don't you think, we should also clear it during the replay of
>>> XLOG_HASH_DELETE?  We might want to log the clear of flag along with
>>> WAL record for XLOG_HASH_DELETE.
>>>
>>
>> Yes, it should be cleared. I completely missed this part in a hurry.
>> Thanks for informing. I have taken care of it in the attached v2
>> patch.
>>
>
> + /*
> + * Mark the page as not containing any LP_DEAD items. See comments
> + * in hashbucketcleanup() for details.
> + */
> + pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
> + pageopaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
>
> Your comment here says, refer hashbucketcleanup and in that function,
> the comment says "Clearing this flag is just a hint; replay won't redo
> this.".  Both seems contradictory.  You need to change the comment in
> hashbucketcleanup.

Done. Please check the attached v3 patch.

As I said in my previous e-mail, I think you need
> to record clearing of this flag in WAL record XLOG_HASH_DELETE as you
> are not doing this unconditionally and then during replay clear it
> only when the WAL record indicates the same.

Thank you so much for putting that point. I too think that we should
record the flag status in the WAL record and clear it only when
required during replay.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-during-replay-v3.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] WAL Consistency checking for hash indexes

2017-03-17 Thread Ashutosh Sharma

On Fri, Mar 17, 2017 at 9:03 AM, Amit Kapila  wrote:
> On Thu, Mar 16, 2017 at 1:15 PM, Ashutosh Sharma  
> wrote:
>> Hi,
>>
>> Attached is the patch that allows WAL consistency tool to mask
>> 'LH_PAGE_HAS_DEAD_TUPLES' flag in hash index. The flag got added as a
>> part of 'Microvacuum support for Hash index' patch . I have already
>> tested it using Kuntal's WAL consistency tool and it works fine.
>>
>
> + * unlogged. So, mask it. See _hash_kill_items(), MarkBufferDirtyHint()
> + * for
> details.
> + */
>
> I think in above comment, a reference to _hash_kill_items is
> sufficient.  Other than that patch looks okay.

Okay, I have removed the reference to MarkBufferDirtyHint() from above
comment. Attached is the v2 version of patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


0001-Allow-WAL-consistency-tool-to-mask-LH_PAGE_HAS_DEAD_.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Microvacuum support for Hash Index

2017-03-17 Thread Ashutosh Sharma

On Fri, Mar 17, 2017 at 6:13 PM, Amit Kapila  wrote:
> On Fri, Mar 17, 2017 at 12:27 PM, Ashutosh Sharma  
> wrote:
>> On Fri, Mar 17, 2017 at 8:20 AM, Amit Kapila  wrote:
>>
>> As I said in my previous e-mail, I think you need
>>> to record clearing of this flag in WAL record XLOG_HASH_DELETE as you
>>> are not doing this unconditionally and then during replay clear it
>>> only when the WAL record indicates the same.
>>
>> Thank you so much for putting that point. I too think that we should
>> record the flag status in the WAL record and clear it only when
>> required during replay.
>>
>
> I think hashdesc.c needs an update (refer case XLOG_HASH_DELETE:).

Done. Thanks!

>
> - * flag. Clearing this flag is just a hint; replay won't redo this.
> + * flag. Clearing this flag is just a hint; replay will check the
> + * status of clear_dead_marking flag before redo it.
>   */
>   if (tuples_removed && *tuples_removed > 0 &&
>   opaque->hasho_flag & LH_PAGE_HAS_DEAD_TUPLES)
> + {
>   opaque->hasho_flag &= ~LH_PAGE_HAS_DEAD_TUPLES;
> + clear_dead_marking = true;
> + }
>
>
> I feel the above comment is not required as you are logging this
> action explicitly.

That's right. I have removed it in the attached v4 patch.

>
> + bool clear_dead_marking; /* TRUE if VACUUM clears
>
> No need to write VACUUM explicitly, you can simply say "TRUE if this
> operation clears ...".

Corrected. Please find the attached v4 patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


0001-Reset-LH_PAGE_HAS_DEAD_TUPLES-flag-during-replay-v4.patch
Description: binary/octet-stream

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect and hash indexes

2017-03-17 Thread Ashutosh Sharma

On Fri, Mar 17, 2017 at 10:54 PM, Jeff Janes  wrote:
> While trying to figure out some bloating in the newly logged hash indexes,
> I'm looking into the type of each page in the index.  But I get an error:
>
> psql -p 9876 -c "select hash_page_type(get_raw_page('foo_index_idx',x)) from
> generate_series(1650,1650) f(x)"
>
> ERROR:  page is not a hash page
> DETAIL:  Expected ff80, got .
>
> The contents of the page are:
>
> \xa400d8f203bf65c91800f01ff01f0420...
>
> (where the elided characters at the end are all zero)
>
> What kind of page is that actually?

it is basically either a newly allocated bucket page or a freed overflow page.

You are seeing this error because of following change in the WAL patch
for hash index,

Basically, when we started working on WAL for hash index, we found
that WAL routine 'XLogReadBufferExtended' does not expect a page to be
completely zero page else it returns Invalid Buffer. To fix this, we
started initializing freed overflow page or new bucket pages using
_hash_pageinit() which basically initialises page header portion but
not it's special area where page type information is present. That's
why you are seeing an ERROR saying 'page is not a hash page'. Actually
pageinspect module needs to handle this type of page. Currently it is
just handling zero pages but not an empty pages. I will submit a patch
for this.

And isn't it unhelpful to have the
> pageinspect module throw errors, rather than returning a dummy value to
> indicate there was an error?

Well, this is something not specific to hash index. So, I have no answer :)

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect and hash indexes

2017-03-18 Thread Ashutosh Sharma

On Sat, Mar 18, 2017 at 1:34 PM, Amit Kapila  wrote:
> On Sat, Mar 18, 2017 at 12:12 AM, Ashutosh Sharma  
> wrote:
>> On Fri, Mar 17, 2017 at 10:54 PM, Jeff Janes  wrote:
>>> While trying to figure out some bloating in the newly logged hash indexes,
>>> I'm looking into the type of each page in the index.  But I get an error:
>>>
>>> psql -p 9876 -c "select hash_page_type(get_raw_page('foo_index_idx',x)) from
>>> generate_series(1650,1650) f(x)"
>>>
>>> ERROR:  page is not a hash page
>>> DETAIL:  Expected ff80, got .
>>>
>>> The contents of the page are:
>>>
>>> \xa400d8f203bf65c91800f01ff01f0420...
>>>
>>> (where the elided characters at the end are all zero)
>>>
>>> What kind of page is that actually?
>>
>> it is basically either a newly allocated bucket page or a freed overflow 
>> page.
>>
>
> What makes you think that it can be a newly allocated page?
> Basically, we always initialize the special space of newly allocated
> page, so not sure what makes you deduce that it can be newly allocated
> page.

I came to know this from the following experiment.

I  created a hash index and kept on inserting data in it till the split happens.

When split happened, I could see following values for firstblock and
lastblock in _hash_alloc_buckets()

Breakpoint 1, _hash_alloc_buckets (rel=0x7f6ac951ee30, firstblock=34,
nblocks=32) at hashpage.c:993
(gdb) n
(gdb) pfirstblock
$15 = 34
(gdb) pnblocks
$16 = 32
(gdb) n
(gdb) plastblock
$17 = 65

AFAIU, this bucket split resulted in creation of new bucket pages from
block number 34 to 65.

The contents for metap are as follows,

(gdb) p*metap
$18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples =
2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096,
hashm_bmshift = 15,
  hashm_maxbucket = 32,hashm_highmask = 63, hashm_lowmask = 31,
hashm_ovflpoint = 6, hashm_firstfree = 0, hashm_nmaps = 1,
hashm_procid = 450,
  hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 },
hashm_mapp = {33,0 }}

Now, if i try to check the page type for block number 65, this is what i see,

test=# select * from hash_page_type(get_raw_page('con_hash_index', 65));
ERROR:  page is not a hash page
DETAIL:  Expected ff80, got .
test=#

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] ANALYZE command progress checker

2017-03-18 Thread Ashutosh Sharma

Hi Vinayak,

Here are couple of review comments that may need your attention.

1. Firstly, I am seeing some trailing whitespace errors when trying to
apply your v3 patch using git apply command.

[ashu@localhost postgresql]$ git apply pg_stat_progress_analyze_v3.patch
pg_stat_progress_analyze_v3.patch:155: trailing whitespace.
CREATE VIEW pg_stat_progress_analyze AS
pg_stat_progress_analyze_v3.patch:156: trailing whitespace.
SELECT
pg_stat_progress_analyze_v3.patch:157: trailing whitespace.
S.pid AS pid, S.datid AS datid, D.datname AS datname,
pg_stat_progress_analyze_v3.patch:158: trailing whitespace.
S.relid AS relid,
pg_stat_progress_analyze_v3.patch:159: trailing whitespace.
CASE S.param1 WHEN 0 THEN 'initializing'
error: patch failed: doc/src/sgml/monitoring.sgml:521

2) The test-case 'rules' is failing.  I think it's failing because in
rules.sql 'ORDERBY viewname' is used which will put
'pg_stat_progress_analyze' ahead of 'pg_stat_progress_vacuum' in the
list of catalog views. You may need to correct rules.out file
accordingly. This is what i could see in rules.sql,

SELECT viewname, definition FROM pg_views WHERE schemaname <>
'information_schema' ORDER BY viewname;

I am still reviewing your patch and doing some testing. Will update if
i find any issues. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Fri, Mar 17, 2017 at 3:16 PM, vinayak
 wrote:
>
> On 2017/03/17 10:38, Robert Haas wrote:
>>
>> On Fri, Mar 10, 2017 at 2:46 AM, vinayak
>>  wrote:
>>>
>>> Thank you for reviewing the patch.
>>>
>>> The attached patch incorporated Michael and Amit comments also.
>>
>> I reviewed this tonight.
>
> Thank you for reviewing the patch.
>>
>> +/* Report compute index stats phase */
>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
>> +
>> PROGRESS_ANALYZE_PHASE_COMPUTE_INDEX_STATS);
>>
>> Hmm, is there really any point to this?  And is it even accurate?  It
>> doesn't look to me like we are computing any index statistics here;
>> we're only allocating a few in-memory data structures here, I think,
>> which is not a statistics computation and probably doesn't take any
>> significant time.
>>
>> +/* Report compute heap stats phase */
>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
>> +
>> PROGRESS_ANALYZE_PHASE_COMPUTE_HEAP_STATS);
>>
>> The phase you label as computing heap statistics also includes the
>> computation of index statistics.  I wouldn't bother separating the
>> computation of heap and index statistics; I'd just have a "compute
>> statistics" phase that begins right after the comment that starts
>> "Compute the statistics."
>
> Understood. Fixed in the attached patch.
>>
>>
>> +/* Report that we are now doing index cleanup */
>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
>> +PROGRESS_ANALYZE_PHASE_INDEX_CLEANUP);
>>
>> OK, this doesn't make any sense either.  We are not cleaning up the
>> indexes here.  We are just closing them and releasing resources.  I
>> don't see why you need this phase at all; it can't last more than some
>> tiny fraction of a second.  It seems like you're trying to copy the
>> exact same phases that exist for vacuum instead of thinking about what
>> makes sense for ANALYZE.
>
> Understood. I have removed this phase.
>>
>>
>> +/* Report number of heap blocks scaned so far*/
>> +pgstat_progress_update_param(PROGRESS_ANALYZE_HEAP_BLKS_SCANNED,
>> targblock);
>>
>> While I don't think this is necessarily a bad thing to report, I find
>> it very surprising that you're not reporting what seems to me like the
>> most useful thing here - namely, the number of blocks that will be
>> sampled in total and the number of those that you've selected.
>> Instead, you're just reporting the length of the relation and the
>> last-sampled block, which is a less-accurate guide to total progress.
>> There are enough counters to report both things, so maybe we should.
>>
>> +/* Report total number of sample rows*/
>> +pgstat_progress_update_param(PROGRESS_ANALYZE_TOTAL_SAMPLE_ROWS,
>> numrows);
>>
>> As an alternative to the previous paragraph, yet another thing you
>> could report is number of rows sampled out of total rows to be
>> sampled.  But this isn't the way to do it: you are reporting the
>> number of rows you sampled

Re: [HACKERS] ANALYZE command progress checker

2017-03-19 Thread Ashutosh Sharma

Hi,

I didn't find any major issues with the patch. It works as expected.
However, I have few minor comments that I would like to share,

+ 
+   Total number of sample rows. The sample it reads is taken randomly.
+   Its size depends on the default_statistics_target
parameter value.
+ 

1) I think it would be better if you could specify reference link to
the guc variable 'default_statistics_target'. Something like this,

.

If you go through monitoring.sgml, you would find that there is
reference link to all guc variables being used.

2) I feel that the 'computing_statistics' phase could have been
splitted into 'computing_column_stats', 'computing_index_stats'
Please let me know your thoughts on this.


+   certain commands during command execution.  Currently, the command
+   which supports progress reporting is VACUUM and
ANALYZE.  This may be
expanded in the future.

3) I think above needs to be rephrased. Something like...Currently,
the supported progress reporting commands are 'VACUUM' and
'ANALYZE'.

Moreover, I also ran your patch on Windows platform and didn't find
any issues. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Sat, Mar 18, 2017 at 5:30 PM, Ashutosh Sharma  wrote:
> Hi Vinayak,
>
> Here are couple of review comments that may need your attention.
>
> 1. Firstly, I am seeing some trailing whitespace errors when trying to
> apply your v3 patch using git apply command.
>
> [ashu@localhost postgresql]$ git apply pg_stat_progress_analyze_v3.patch
> pg_stat_progress_analyze_v3.patch:155: trailing whitespace.
> CREATE VIEW pg_stat_progress_analyze AS
> pg_stat_progress_analyze_v3.patch:156: trailing whitespace.
> SELECT
> pg_stat_progress_analyze_v3.patch:157: trailing whitespace.
> S.pid AS pid, S.datid AS datid, D.datname AS datname,
> pg_stat_progress_analyze_v3.patch:158: trailing whitespace.
> S.relid AS relid,
> pg_stat_progress_analyze_v3.patch:159: trailing whitespace.
> CASE S.param1 WHEN 0 THEN 'initializing'
> error: patch failed: doc/src/sgml/monitoring.sgml:521
>
> 2) The test-case 'rules' is failing.  I think it's failing because in
> rules.sql 'ORDERBY viewname' is used which will put
> 'pg_stat_progress_analyze' ahead of 'pg_stat_progress_vacuum' in the
> list of catalog views. You may need to correct rules.out file
> accordingly. This is what i could see in rules.sql,
>
> SELECT viewname, definition FROM pg_views WHERE schemaname <>
> 'information_schema' ORDER BY viewname;
>
> I am still reviewing your patch and doing some testing. Will update if
> i find any issues. Thanks.
>
> --
> With Regards,
> Ashutosh Sharma
> EnterpriseDB:http://www.enterprisedb.com
>
> On Fri, Mar 17, 2017 at 3:16 PM, vinayak
>  wrote:
>>
>> On 2017/03/17 10:38, Robert Haas wrote:
>>>
>>> On Fri, Mar 10, 2017 at 2:46 AM, vinayak
>>>  wrote:
>>>>
>>>> Thank you for reviewing the patch.
>>>>
>>>> The attached patch incorporated Michael and Amit comments also.
>>>
>>> I reviewed this tonight.
>>
>> Thank you for reviewing the patch.
>>>
>>> +/* Report compute index stats phase */
>>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
>>> +
>>> PROGRESS_ANALYZE_PHASE_COMPUTE_INDEX_STATS);
>>>
>>> Hmm, is there really any point to this?  And is it even accurate?  It
>>> doesn't look to me like we are computing any index statistics here;
>>> we're only allocating a few in-memory data structures here, I think,
>>> which is not a statistics computation and probably doesn't take any
>>> significant time.
>>>
>>> +/* Report compute heap stats phase */
>>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
>>> +
>>> PROGRESS_ANALYZE_PHASE_COMPUTE_HEAP_STATS);
>>>
>>> The phase you label as computing heap statistics also includes the
>>> computation of index statistics.  I wouldn't bother separating the
>>> computation of heap and index statistics; I'd just have a "compute
>>> statistics" phase that begins right after the comment that starts
>>> "Compute the statistics."
>>
>> Understood. Fixed in the attached patch.
>>>
>>>
>>> +/* Report that we are now doing index cleanup */
>>> +pgstat_progress_update_param(PROGRESS_ANALYZE_PHASE,
>>> +PROGRESS_ANALYZE_PHASE_INDEX_CLEANUP);
>

Re: [HACKERS] pageinspect and hash indexes

2017-03-20 Thread Ashutosh Sharma

On Mon, Mar 20, 2017 at 9:31 AM, Amit Kapila  wrote:
> On Sat, Mar 18, 2017 at 5:13 PM, Ashutosh Sharma  
> wrote:
>> On Sat, Mar 18, 2017 at 1:34 PM, Amit Kapila  wrote:
>>> On Sat, Mar 18, 2017 at 12:12 AM, Ashutosh Sharma  
>>> wrote:
>>>> On Fri, Mar 17, 2017 at 10:54 PM, Jeff Janes  wrote:
>>>>> While trying to figure out some bloating in the newly logged hash indexes,
>>>>> I'm looking into the type of each page in the index.  But I get an error:
>>>>>
>>>>> psql -p 9876 -c "select hash_page_type(get_raw_page('foo_index_idx',x)) 
>>>>> from
>>>>> generate_series(1650,1650) f(x)"
>>>>>
>>>>> ERROR:  page is not a hash page
>>>>> DETAIL:  Expected ff80, got .
>>>>>
>>>>> The contents of the page are:
>>>>>
>>>>> \xa400d8f203bf65c91800f01ff01f0420...
>>>>>
>>>>> (where the elided characters at the end are all zero)
>>>>>
>>>>> What kind of page is that actually?
>>>>
>>>> it is basically either a newly allocated bucket page or a freed overflow 
>>>> page.
>>>>
>>>
>>> What makes you think that it can be a newly allocated page?
>>> Basically, we always initialize the special space of newly allocated
>>> page, so not sure what makes you deduce that it can be newly allocated
>>> page.
>>
>> I came to know this from the following experiment.
>>
>> I  created a hash index and kept on inserting data in it till the split 
>> happens.
>>
>> When split happened, I could see following values for firstblock and
>> lastblock in _hash_alloc_buckets()
>>
>> Breakpoint 1, _hash_alloc_buckets (rel=0x7f6ac951ee30, firstblock=34,
>> nblocks=32) at hashpage.c:993
>> (gdb) n
>> (gdb) pfirstblock
>> $15 = 34
>> (gdb) pnblocks
>> $16 = 32
>> (gdb) n
>> (gdb) plastblock
>> $17 = 65
>>
>> AFAIU, this bucket split resulted in creation of new bucket pages from
>> block number 34 to 65.
>>
>> The contents for metap are as follows,
>>
>> (gdb) p*metap
>> $18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples =
>> 2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096,
>> hashm_bmshift = 15,
>>   hashm_maxbucket = 32,hashm_highmask = 63, hashm_lowmask = 31,
>> hashm_ovflpoint = 6, hashm_firstfree = 0, hashm_nmaps = 1,
>> hashm_procid = 450,
>>   hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 },
>> hashm_mapp = {33,0 }}
>>
>> Now, if i try to check the page type for block number 65, this is what i see,
>>
>> test=# select * from hash_page_type(get_raw_page('con_hash_index', 65));
>> ERROR:  page is not a hash page
>> DETAIL:  Expected ff80, got .
>> test=#
>>
>
> The contents of such a page should be zero and Jeff has reported some
> valid-looking contents of the page.  If you see this page contents as
> zero, then we can conclude what Jeff is seeing was an freed overflow
> page.

As shown in the mail thread-[1], the contents of metapage are,

(gdb) p*metap
$18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples
=2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096,
hashm_bmshift = 15, hashm_maxbucket = 32,hashm_highmask = 63,
hashm_lowmask = 31, hashm_ovflpoint = 6, hashm_firstfree = 0,
hashm_nmaps = 1,
hashm_procid = 450, hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 }, hashm_mapp = {33,0 }}

postgres=# select spares from
hash_metapage_info(get_raw_page('con_hash_index', 0));
  spares
---
 {0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
(1 row)

Here, if you see the spare page count  is just 1 which corresponds to
bitmap page. So, that means there is no overflow page in hash index
table and neither I have ran any DELETE statements in my experiment
that would result in freeing an overflow page.

Also, the page header content for such a page is,

$9 = {pd_lsn = {xlogid = 0, xrecoff = 23638120}, pd_checksum = 0,
pd_flags = 0,pd_lower = 24, pd_upper = 8176,pd_special = 8176,
  pd_pagesize_version = 8196, pd_prune_xid = 0,pd_linp = 0x1f3aa88}

>From values of pd_lower and pd_upper it is clear that it is an empty page.

The content of this page is,

\xb0308a011800f01ff01f042000.

I think it is not just happening for freed overflow but also for newly
allocated

[HACKERS] comments in hash_alloc_buckets

2017-03-20 Thread Ashutosh Sharma

Hi,

While working on - [1], I realised that the following comments in
_hash_alloc_buckets() needs to be corrected.

/*
 * Initialize the freed overflow page.  Just zeroing the page won't work,
 * See _hash_freeovflpage for similar usage.
 */
_hash_pageinit(page, BLCKSZ);

Attached is the patch that corrects above comment. Thanks.

[1] - 
https://www.postgresql.org/message-id/CAMkU%3D1y6NjKmqbJ8wLMhr%3DF74WzcMALYWcVFhEpm7i%3DmV%3DXsOg%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


corrected_comments_hash_alloc_buckets.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect and hash indexes

2017-03-20 Thread Ashutosh Sharma

On Mon, Mar 20, 2017 at 6:53 PM, Ashutosh Sharma  wrote:
> On Mon, Mar 20, 2017 at 9:31 AM, Amit Kapila  wrote:
>> On Sat, Mar 18, 2017 at 5:13 PM, Ashutosh Sharma  
>> wrote:
>>> On Sat, Mar 18, 2017 at 1:34 PM, Amit Kapila  
>>> wrote:
>>>> On Sat, Mar 18, 2017 at 12:12 AM, Ashutosh Sharma  
>>>> wrote:
>>>>> On Fri, Mar 17, 2017 at 10:54 PM, Jeff Janes  wrote:
>>>>>> While trying to figure out some bloating in the newly logged hash 
>>>>>> indexes,
>>>>>> I'm looking into the type of each page in the index.  But I get an error:
>>>>>>
>>>>>> psql -p 9876 -c "select hash_page_type(get_raw_page('foo_index_idx',x)) 
>>>>>> from
>>>>>> generate_series(1650,1650) f(x)"
>>>>>>
>>>>>> ERROR:  page is not a hash page
>>>>>> DETAIL:  Expected ff80, got .
>>>>>>
>>>>>> The contents of the page are:
>>>>>>
>>>>>> \xa400d8f203bf65c91800f01ff01f0420...
>>>>>>
>>>>>> (where the elided characters at the end are all zero)
>>>>>>
>>>>>> What kind of page is that actually?
>>>>>
>>>>> it is basically either a newly allocated bucket page or a freed overflow 
>>>>> page.
>>>>>
>>>>
>>>> What makes you think that it can be a newly allocated page?
>>>> Basically, we always initialize the special space of newly allocated
>>>> page, so not sure what makes you deduce that it can be newly allocated
>>>> page.
>>>
>>> I came to know this from the following experiment.
>>>
>>> I  created a hash index and kept on inserting data in it till the split 
>>> happens.
>>>
>>> When split happened, I could see following values for firstblock and
>>> lastblock in _hash_alloc_buckets()
>>>
>>> Breakpoint 1, _hash_alloc_buckets (rel=0x7f6ac951ee30, firstblock=34,
>>> nblocks=32) at hashpage.c:993
>>> (gdb) n
>>> (gdb) pfirstblock
>>> $15 = 34
>>> (gdb) pnblocks
>>> $16 = 32
>>> (gdb) n
>>> (gdb) plastblock
>>> $17 = 65
>>>
>>> AFAIU, this bucket split resulted in creation of new bucket pages from
>>> block number 34 to 65.
>>>
>>> The contents for metap are as follows,
>>>
>>> (gdb) p*metap
>>> $18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples =
>>> 2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096,
>>> hashm_bmshift = 15,
>>>   hashm_maxbucket = 32,hashm_highmask = 63, hashm_lowmask = 31,
>>> hashm_ovflpoint = 6, hashm_firstfree = 0, hashm_nmaps = 1,
>>> hashm_procid = 450,
>>>   hashm_spares = {0, 0,0, 0, 0, 1, 1, 0 },
>>> hashm_mapp = {33,0 }}
>>>
>>> Now, if i try to check the page type for block number 65, this is what i 
>>> see,
>>>
>>> test=# select * from hash_page_type(get_raw_page('con_hash_index', 65));
>>> ERROR:  page is not a hash page
>>> DETAIL:  Expected ff80, got .
>>> test=#
>>>
>>
>> The contents of such a page should be zero and Jeff has reported some
>> valid-looking contents of the page.  If you see this page contents as
>> zero, then we can conclude what Jeff is seeing was an freed overflow
>> page.
>
> As shown in the mail thread-[1], the contents of metapage are,
>
> (gdb) p*metap
> $18 = {hashm_magic = 105121344,hashm_version = 2, hashm_ntuples
> =2593, hashm_ffactor = 81, hashm_bsize = 8152, hashm_bmsize = 4096,
> hashm_bmshift = 15, hashm_maxbucket = 32,hashm_highmask = 63,
> hashm_lowmask = 31, hashm_ovflpoint = 6, hashm_firstfree = 0,
> hashm_nmaps = 1,
> hashm_procid = 450, hashm_spares = {0, 0,0, 0, 0, 1, 1, 0  25 times>}, hashm_mapp = {33,0 }}
>
> postgres=# select spares from
> hash_metapage_info(get_raw_page('con_hash_index', 0));
>   spares
> ---
>  {0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}
> (1 row)
>
> Here, if you see the spare page count  is just 1 which corresponds to
> bitmap page. So, that means there is no overflow page in hash index
> table and neither I have ran any DELETE statements in my experiment
> that would result in freeing an overflow

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-21 Thread Ashutosh Sharma

Hi Jeff,

On Tue, Mar 21, 2017 at 1:54 PM, Amit Kapila  wrote:
> On Tue, Mar 21, 2017 at 1:28 PM, Jeff Janes  wrote:
>> Against an unmodified HEAD (17fa3e8), I got a segfault in the hot standby.
>>
>
> I think I see the problem in hash_xlog_vacuum_get_latestRemovedXid().
> It seems to me that we are using different block_id for registering
> the deleted items in xlog XLOG_HASH_VACUUM_ONE_PAGE and then using
> different block_id for fetching those items in
> hash_xlog_vacuum_get_latestRemovedXid().  So probably matching those
> will fix this issue (instead of fetching block number and items from
> block_id 1, we should use block_id 0).
>

Thanks for reporting this issue. As Amit said, it is happening due to
block_id mismatch. Attached is the patch that fixes the same. I
apologise for such a silly mistake. Please note that  I was not able
to reproduce the issue on my local machine using the test script you
shared. Could you please check with the attached patch if you are
still seeing the issue. Thanks in advance.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


corrected_block_id_reference_in_hash_vacuum_get_latestRemovedXid.patch
Description: application/download

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-21 Thread Ashutosh Sharma

Hi Jeff,

>
> I can confirm that that fixes the seg faults for me.

Thanks for confirmation.

>
> Did you mean you couldn't reproduce the problem in the first place, or that
> you could reproduce it and now the patch fixes it?  If the first of those, I
> forget to say you do have to wait for hot standby to reach a consistency and
> open for connections, and then connect to the standby ("psql -p 9874"),
> before the seg fault will be triggered.

I meant that I was not able to reproduce the issue on HEAD.

>
> But, there are places where hash_xlog_vacuum_get_latestRemovedXid diverges
> from btree_xlog_delete_get_latestRemovedXid, which I don't understand the
> reason for the divergence.  Is there a reason we dropped the PANIC if we
> have not reached consistency?

Well, I'm not quite sure how would standby allow any backend to
connect to it until it has reached to a consistent state. If you see
the definition of btree_xlog_delete_get_latestRemovedXid(), just
before consistency check there is a if-condition 'if
(CountDBBackends(InvalidOid) == 0)' which means
we are checking for consistent state only after knowing that there are
some backends connected to the standby. So, Is there a possibility of
having some backend connected to standby server without having it in
consistent state.

That is a case which should never happen, but
> it seems worth preserving the PANIC.  And why does this code get 'unused'
> from XLogRecGetBlockData(record, 0, &len), while the btree code gets it from
> xlrec?  Is that because the record being replayed is structured differently
> between btree and hash, or is there some other reason?
>

Yes, That's right.  In case of btree index, we have used
XLogRegisterData() to add data to WAL record and to fetch the same we
use XLogRecGetData(). In case of hash index we have associated the
same data with some registered buffer using
XLogRegisterBufferData(0,...) and to fetch the same we use
XLogRecGetBlockData(0,...).  Now, if you see XLogRecordAssemble, it
first adds all the data assciated with registered buffers into the WAL
record followed by the main data (). Hence, the WAL record in  btree
and hash are organised differently.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-22 Thread Ashutosh Sharma

Hi,

On Wed, Mar 22, 2017 at 8:41 AM, Amit Kapila  wrote:
> On Tue, Mar 21, 2017 at 11:49 PM, Ashutosh Sharma  
> wrote:
>>>
>>> I can confirm that that fixes the seg faults for me.
>>
>> Thanks for confirmation.
>>
>>>
>>> Did you mean you couldn't reproduce the problem in the first place, or that
>>> you could reproduce it and now the patch fixes it?  If the first of those, I
>>> forget to say you do have to wait for hot standby to reach a consistency and
>>> open for connections, and then connect to the standby ("psql -p 9874"),
>>> before the seg fault will be triggered.
>>
>> I meant that I was not able to reproduce the issue on HEAD.
>>
>>>
>>> But, there are places where hash_xlog_vacuum_get_latestRemovedXid diverges
>>> from btree_xlog_delete_get_latestRemovedXid, which I don't understand the
>>> reason for the divergence.  Is there a reason we dropped the PANIC if we
>>> have not reached consistency?
>>
>> Well, I'm not quite sure how would standby allow any backend to
>> connect to it until it has reached to a consistent state. If you see
>> the definition of btree_xlog_delete_get_latestRemovedXid(), just
>> before consistency check there is a if-condition 'if
>> (CountDBBackends(InvalidOid) == 0)' which means
>> we are checking for consistent state only after knowing that there are
>> some backends connected to the standby. So, Is there a possibility of
>> having some backend connected to standby server without having it in
>> consistent state.
>>
>
> I don't think so, but I think we should have reachedConsistency check
> and elog(PANIC,..) similar to btree.  If you see other conditions
> where we PANIC in btree or hash xlog code, you will notice that those
> are also theoretically not possible cases.  It seems this is to save
> database from getting corrupt or behaving insanely if due to some
> reason (like a coding error or others) the check fails.

okay, agreed. I have included it in the attached patch.

However,  I am still able to see the crash reported by Jeff upthread -
[1]. I think there are couple of things that needs to be corrected
inside hash_xlog_vacuum_get_latestRemovedXid().

1) As Amit mentioned in his earlier mail [2], the block id used for
registering deleted items in XLOG_HASH_VACUUM_ONE_PAGE is different
than the block id used for fetching those items. I had fixed this and
shared the patch earlier. With this patch I still see the crash Jeff
reported yesterday [1].

2) When a full page image of registered block is taken, the modified
data associated with that block is not included in the WAL record. In
such a case, we won't be able to fetch the items array that we have
tried to include during xlog record (XLOG_HASH_VACUUM_ONE_PAGE)
creation using following function.

   XLogRegisterBufData(0, (char *) deletable, ndeletable *
sizeof(OffsetNumber));

If above is not included in the WAL record, then fetching the same
using 'XLogRecGetBlockData(record, 0, &len);' will return NULL
pointer.

 ptr = XLogRecGetBlockData(record, 0, &len);
 unused = (OffsetNumber *) ptr;
 
 iitemid = PageGetItemId(ipage, unused[i]);

Here, if ptr is NULL, reference to unused will cause a crash.

To fix this, I think we should pass 'REGBUF_KEEP_DATA' while
registering the buffer. Something like this,

-   XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
+   XLogRegisterBuffer(0, buf, REGBUF_STANDARD |
REGBUF_KEEP_DATA);

Attached is the patch that fixes this issue. Please have a look and
let me know if you still have any concerns. Thank you.

[1] - 
https://www.postgresql.org/message-id/CAMkU%3D1w-9Qe%3DFf1o6bSaXpNO9wqpo7_9GL8_CVhw4BoVVHasqg%40mail.gmail.com
[2] - 
https://www.postgresql.org/message-id/CAA4eK1%2BYUedok0%2Bmeynnf8K3fqFsfdMpEFz1JiKLyyNv46hjaA%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index de7522e..75c7c43 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -977,12 +977,20 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
 		return latestRemovedXid;
 
 	/*
+	 * Check if WAL replay has reached a consistent database state. If not,
+	 * we must PANIC. See the definition of btree_xlog_delete_get_latestRemovedXid
+	 * on which this function is based.
+	 */
+	if (!reachedConsistency)
+		elog(PANIC, "hash_xlog_vacuum_get_latestRemovedXid: cannot operate with inconsistent data");
+
+	/*
 	 * Get index page.  If the DB is consistent, this should not fail, nor
 	 * should any of the heap page fetches below.  If one does, we return
 	 * I

Re: [HACKERS] Page Scan Mode in Hash Index

2017-03-22 Thread Ashutosh Sharma

Hi,

>> Attached patch modifies hash index scan code for page-at-a-time mode.
>> For better readability, I have splitted it into 3 parts,
>>
>
> Due to the commits on master these patches applies with hunks.
>
> The README should be updated to mention the use of page scan.

Done. Please refer to the attached v2 version of patch.

>
> hash.h needs pg_indent.

Fixed.

>
>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>> patch rewrites the hash index scan module to work in page-at-a-time
>> mode. It basically introduces two new functions-- _hash_readpage() and
>> _hash_saveitem(). The former is used to load all the qualifying tuples
>> from a target bucket or overflow page into an items array. The latter
>> one is used by _hash_readpage to save all the qualifying tuples found
>> in a page into an items array. Apart from that, this patch bascially
>> cleans _hash_first(), _hash_next and hashgettuple().
>>
>
> For _hash_next I don't see this - can you explain ?

Sorry, It was wrongly copied from btree code. I have corrected it now. Please
check the attached v2 verison of patch.

>
> + *
> + * On failure exit (no more tuples), we release pin and set
> + * so->currPos.buf to InvalidBuffer.
>
>
> + * Returns true if any matching items are found else returns false.
>
> s/Returns/Return/g

Done.

>
>> 2) 0002-Remove-redundant-function-_hash_step-and-some-of-the.patch:
>> this patch basically removes the redundant function _hash_step() and
>> some of the unused members of HashScanOpaqueData structure.
>>
>
> Looks good.
>
>> 3) 0003-Improve-locking-startegy-during-VACUUM-in-Hash-Index.patch:
>> this patch basically improves the locking strategy for VACUUM in hash
>> index. As the new hash index scan works in page-at-a-time, vacuum can
>> release the lock on previous page before acquiring a lock on the next
>> page, hence, improving hash index concurrency.
>>
>
> +* As the new hash index scan work in page at a time mode,
>
> Remove 'new'.

Done.

>
>> I have also done the benchmarking of this patch and would like to
>> share the results for the same,
>>
>> Firstly, I have done the benchmarking with non-unique values and i
>> could see a performance improvement of 4-7%. For the detailed results
>> please find the attached file 'results-non-unique values-70ff', and
>> ddl.sql, test.sql are test scripts used in this experimentation. The
>> detail of non-default GUC params and pgbench command are mentioned in
>> the result sheet. I also did the benchmarking with unique values at
>> 300 and 1000 scale factor and its results are provided in
>> 'results-unique-values-default-ff'.
>>
>
> I'm seeing similar results, and especially with write heavy scenarios.

Great..!!

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 78445d11db7157908d5558eedd253b9b28fb1e3e Mon Sep 17 00:00:00 2001
From: ashu 
Date: Wed, 22 Mar 2017 18:43:40 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev2

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README   |   9 +-
 src/backend/access/hash/hash.c   | 120 ++---
 src/backend/access/hash/hashpage.c   |  14 +-
 src/backend/access/hash/hashsearch.c | 330 ++-
 src/backend/access/hash/hashutil.c   |  23 ++-
 src/include/access/hash.h|  44 +
 6 files changed, 384 insertions(+), 156 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
- the pin on the primary bucket throughout the scan; we also maintain
- a pin on the page currently being scanned)
-	get tuple
-	release content lock
+ the pin on the primary bucket throughout the scan)
+save all the matching tuples from current index page into an items array
+release pin and content lock (but if it is primary bucket page retain
+it's pin till the end of scan)
+get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..2450ee1 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,65 +268,24 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->ind

Re: [HACKERS] pageinspect and hash indexes

2017-03-22 Thread Ashutosh Sharma

>> I think it is not just happening for freed overflow but also for newly
>> allocated bucket page. It would be good if we could mark  freed
>> overflow page as UNUSED page rather than just initialising it's header
>> portion and leaving the page type in special area as it is. Attached
>> is the patch '0001-mark_freed_ovflpage_as_UNUSED_pagetype.patch' that
>> marks a freed overflow page as an unused page.
>>
>
>   _hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
> +
> + ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
> +
> + ovflopaque->hasho_prevblkno = InvalidBlockNumber;
> + ovflopaque->hasho_nextblkno = InvalidBlockNumber;
> + ovflopaque->hasho_bucket = -1;
> + ovflopaque->hasho_flag = LH_UNUSED_PAGE;
> + ovflopaque->hasho_page_id = HASHO_PAGE_ID;
> +
>
> You need similar initialization in replay routine as well.

I will do that and share the patch shortly. Thanks.

>
>> Also, I have now changed pageinspect module to handle unused and empty
>> hash index page. Attached is the patch
>> (0002-allow_pageinspect_handle_UNUSED_OR_EMPTY_hash_pages.patch) for
>> the same.
>>
>
> 1.
> @@ -70,6 +70,17 @@ verify_hash_page(bytea *raw_page, int flags)
>   pageopaque = (HashPageOpaque) PageGetSpecialPointer(page);
> +
> + /* Check if it is an unused hash page. */
> + if (pageopaque->hasho_flag == LH_UNUSED_PAGE)
> + return page;
>
>
> I don't see we need to have a separate check for unused page, why
> can't it be checked with other page types in below check.
>
> if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE &&
> pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE)

That is because UNUSED page is as good as an empty page except that
empty page doesn't have any pagetype. If we add condition for checking
UNUSED page in above if check it will never show unused page as an
unsed page rather it will always show it as an empty page. To avoid
this, at the start of verify_hash_page function itself if we recognise
page as UNUSED page we return immediately.

>
> 2.
> + /* Check if it is an empty hash page. */
> + if (PageIsEmpty(page))
> + ereport(ERROR,
> + (errcode(ERRCODE_INDEX_CORRUPTED),
> + errmsg("index table contains empty page")));
>
>
> Do we want to give a separate message for EMPTY and NEW pages?  Isn't
> it better that the same error message can be given for both of them as
> from user perspective there is not much difference between both the
> messages?

I think we should show separate message because they are two different
type of pages. In the sense like, one is initialised whereas other is
completely zero.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Speed up Clog Access by increasing CLOG buffers

2017-03-23 Thread Ashutosh Sharma

Hi All,

I have tried to test 'group_update_clog_v11.1.patch' shared upthread by
Amit on a high end machine. I have tested the patch with various savepoints
in my test script. The machine details along with test scripts and the test
results are shown below,

Machine details:

24 sockets, 192 CPU(s)
RAM - 500GB

test script:


\set aid random (1,3000)
\set tid random (1,3000)

BEGIN;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s1;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s2;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s3;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
SAVEPOINT s4;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid for UPDATE;
SAVEPOINT s5;
SELECT tbalance FROM pgbench_tellers WHERE tid = :tid for UPDATE;
END;

Non-default parameters
==
max_connections = 200
shared_buffers=8GB
min_wal_size=10GB
max_wal_size=15GB
maintenance_work_mem = 1GB
checkpoint_completion_target = 0.9
checkpoint_timeout=900
synchronous_commit=off


pgbench -M prepared -c $thread -j $thread -T $time_for_reading postgres -f
~/test_script.sql

where, time_for_reading = 10 mins

Test Results:
=

With 3 savepoints
=

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
128 50275 53704 6.82048732
64 62860 66561 5.887686923
8 18464 18752 1.559792028

With 5 savepoints
=

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
128 46559 47715 2.482871196
64 52306 52082 -0.4282491492
8 12289 12852 4.581332899


With 7 savepoints
=

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
128 41367 41500 0.3215123166
64 42996 41473 -3.542189971
8 9665 9657 -0.0827728919

With 10 savepoints
==

CLIENT COUNT TPS (HEAD) TPS (PATCH) % IMPROVEMENT
128 34513 34597 0.24338655
64 32581 32035 -1.67582
8 7293 7622 4.511175099
*Conclusion:*
As seen from the test results mentioned above, there is some performance
improvement with 3 SP(s), with 5 SP(s) the results with patch is slightly
better than HEAD, with 7 and 10 SP(s) we do see regression with patch.
Therefore, I think the threshold value of 4 for number of subtransactions
considered in the patch looks fine to me.


--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Tue, Mar 21, 2017 at 6:19 PM, Amit Kapila 
wrote:

> On Mon, Mar 20, 2017 at 8:27 AM, Robert Haas 
> wrote:
> > On Fri, Mar 17, 2017 at 2:30 AM, Amit Kapila 
> wrote:
> >>> I was wondering about doing an explicit test: if the XID being
> >>> committed matches the one in the PGPROC, and nsubxids matches, and the
> >>> actual list of XIDs matches, then apply the optimization.  That could
> >>> replace the logic that you've proposed to exclude non-commit cases,
> >>> gxact cases, etc. and it seems fundamentally safer.  But it might be a
> >>> more expensive test, too, so I'm not sure.
> >>
> >> I think if the number of subxids is very small let us say under 5 or
> >> so, then such a check might not matter, but otherwise it could be
> >> expensive.
> >
> > We could find out by testing it.  We could also restrict the
> > optimization to cases with just a few subxids, because if you've got a
> > large number of subxids this optimization probably isn't buying much
> > anyway.
> >
>
> Yes, and I have modified the patch to compare xids and subxids for
> group update.  In the initial short tests (with few client counts), it
> seems like till 3 savepoints we can win and 10 savepoints onwards
> there is some regression or at the very least there doesn't appear to
> be any benefit.  We need more tests to identify what is the safe
> number, but I thought it is better to share the patch to see if we
> agree on the changes because if not, then the whole testing needs to
> be repeated.  Let me know what do you think about attached?
>
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
>
> --
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers
>
>

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-23 Thread Ashutosh Sharma

On Thu, Mar 23, 2017 at 9:17 AM, Amit Kapila  wrote:
>
> On Thu, Mar 23, 2017 at 8:43 AM, Amit Kapila  wrote:
> > On Wed, Mar 22, 2017 at 3:39 PM, Ashutosh Sharma  
> > wrote:
> >> Hi,
> >>
> >> On Wed, Mar 22, 2017 at 8:41 AM, Amit Kapila  
> >> wrote:
> >>
> >> To fix this, I think we should pass 'REGBUF_KEEP_DATA' while
> >> registering the buffer. Something like this,
> >>
> >> -   XLogRegisterBuffer(0, buf, REGBUF_STANDARD);
> >> +   XLogRegisterBuffer(0, buf, REGBUF_STANDARD |
> >> REGBUF_KEEP_DATA);
> >>
> >> Attached is the patch that fixes this issue.
> >>
> >
> > I think this will work, but not sure if there is a merit to deviate
> > from what btree does to handle this case.   One thing I find slightly
> > awkward in hash_xlog_vacuum_get_latestRemovedXid() is that you are
> > using a number of tuples registered as part of fixed data
> > (xl_hash_vacuum_one_page) to traverse the data registered as buf data.
> > I think it will be better if we register offsets also in fixed part of
> > data as we are doing btree case.

Agreed. I have made the changes accordingly. Please check attached v2 patch.

>
> >
> >
>
> Also another small point in this regard, do we need two separate
> variables to track number of deleted items in below code?  I think one
> variable is sufficient.
>
> _hash_vacuum_one_page()
> {
> ..
> deletable[ndeletable++] = offnum;
> tuples_removed += 1;--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
> ..
> }
>

Yes, I think 'ndeletable' alone should be fine.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index de7522e..d9ac42c 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -957,8 +957,6 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
 	OffsetNumber	hoffnum;
 	TransactionId	latestRemovedXid = InvalidTransactionId;
 	int		i;
-	char *ptr;
-	Size len;
 
 	xlrec = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
 
@@ -977,12 +975,20 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
 		return latestRemovedXid;
 
 	/*
+	 * Check if WAL replay has reached a consistent database state. If not,
+	 * we must PANIC. See the definition of btree_xlog_delete_get_latestRemovedXid
+	 * for more details.
+	 */
+	if (!reachedConsistency)
+		elog(PANIC, "hash_xlog_vacuum_get_latestRemovedXid: cannot operate with inconsistent data");
+
+	/*
 	 * Get index page.  If the DB is consistent, this should not fail, nor
 	 * should any of the heap page fetches below.  If one does, we return
 	 * InvalidTransactionId to cancel all HS transactions.  That's probably
 	 * overkill, but it's safe, and certainly better than panicking here.
 	 */
-	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
 	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
 
 	if (!BufferIsValid(ibuffer))
@@ -994,9 +1000,7 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
 	 * Loop through the deleted index items to obtain the TransactionId from
 	 * the heap items they point to.
 	 */
-	ptr = XLogRecGetBlockData(record, 1, &len);
-
-	unused = (OffsetNumber *) ptr;
+	unused = (OffsetNumber *) ((char *) xlrec + SizeOfHashVacuumOnePage);
 
 	for (i = 0; i < xlrec->ntuples; i++)
 	{
@@ -1121,23 +1125,15 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 	if (action == BLK_NEEDS_REDO)
 	{
-		char *ptr;
-		Size len;
-
-		ptr = XLogRecGetBlockData(record, 0, &len);
-
 		page = (Page) BufferGetPage(buffer);
 
-		if (len > 0)
+		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
 		{
 			OffsetNumber *unused;
-			OffsetNumber *unend;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
 
-			if ((unend - unused) > 0)
-PageIndexMultiDelete(page, unused, unend - unused);
+			PageIndexMultiDelete(page, unused, xldata->ntuples);
 		}
 
 		/*
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 8640e85..8699b5b 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -344,7 +344,6 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 	Page	page = BufferGetPage(buf);
 	HashPageOpaque	pageopaque;
 	HashMetaPage	metap;
-	double tuples_removed = 0;
 
 	/* Scan each tuple in page to see if it is marked as LP_DEAD */
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -355

Re: [HACKERS] pageinspect and hash indexes

2017-03-23 Thread Ashutosh Sharma

Hi Amit,

>>> +
>>> + /* Check if it is an unused hash page. */
>>> + if (pageopaque->hasho_flag == LH_UNUSED_PAGE)
>>> + return page;
>>>
>>>
>>> I don't see we need to have a separate check for unused page, why
>>> can't it be checked with other page types in below check.
>>>
>>> if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE &&
>>> pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE)
>>
>> That is because UNUSED page is as good as an empty page except that
>> empty page doesn't have any pagetype. If we add condition for checking
>> UNUSED page in above if check it will never show unused page as an
>> unsed page rather it will always show it as an empty page.
>>
>
> Oh, okay, but my main objection was that we should not check hash page
> type (hasho_flag) without ensuring whether it is a hash page.  Can you
> try to adjust the above code so that this check can be moved after
> hasho_page_id check?

Yes, I got your point. I have done that but then i had to remove the
check for PageIsEmpty(). Anyways, I think PageIsEmpty() condition will
only be true for one page in entire hash index table and can be
ignored. If you wish, I could mention about it in the documentation.

>
>> To avoid
>> this, at the start of verify_hash_page function itself if we recognise
>> page as UNUSED page we return immediately.
>>
>>>
>>> 2.
>>> + /* Check if it is an empty hash page. */
>>> + if (PageIsEmpty(page))
>>> + ereport(ERROR,
>>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>>> + errmsg("index table contains empty page")));
>>>
>>>
>>> Do we want to give a separate message for EMPTY and NEW pages?  Isn't
>>> it better that the same error message can be given for both of them as
>>> from user perspective there is not much difference between both the
>>> messages?
>>
>> I think we should show separate message because they are two different
>> type of pages. In the sense like, one is initialised whereas other is
>> completely zero.
>>
>
> I understand your point, but not sure if it makes any difference to user.
>

okay, I have now anyways removed the check for PageIsEmpty. Please
refer to the attached '0002
allow_pageinspect_handle_UNUSED_hash_pages.patch'


Also, I have attached
'0001-Mark-freed-overflow-page-as-UNUSED-pagetype-v2.patch' that
handles your comment mentioned in [1].

[1] - 
https://www.postgresql.org/message-id/CAA4eK1%2BVE_TDRLWpyeOf%2B7%2B6if68kgPNwO4guKo060rm_t3O5w%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 86eb711ea80b495c9d2f5228558682d0e95c41eb Mon Sep 17 00:00:00 2001
From: ashu 
Date: Thu, 23 Mar 2017 16:47:17 +0530
Subject: [PATCH] Mark freed overflow page as UNUSED pagetype v2

---
 src/backend/access/hash/hash_xlog.c |  9 +
 src/backend/access/hash/hashovfl.c  | 13 -
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d9ac42c..2ccaf46 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -697,11 +697,20 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
 	{
 		Page		ovflpage;
+		HashPageOpaque ovflopaque;
 
 		ovflpage = BufferGetPage(ovflbuf);
 
 		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
 
+		ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+
+		ovflopaque->hasho_prevblkno = InvalidBlockNumber;
+		ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+		ovflopaque->hasho_bucket = -1;
+		ovflopaque->hasho_flag = LH_UNUSED_PAGE;
+		ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
 		PageSetLSN(ovflpage, lsn);
 		MarkBufferDirty(ovflbuf);
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..d3eaee0 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -591,9 +591,20 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/*
 	 * Initialize the freed overflow page.  Just zeroing the page won't work,
 	 * because WAL replay routines expect pages to be initialized. See
-	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended. Also, mark
+	 * the page as UNUSED type and retain it's page id. This allows the tools
+	 * like pageinspect to consider it as a hash page.
 	 */
 	_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
+
+	ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+
+	ovflo

Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.

2017-03-23 Thread Ashutosh Sharma

Hi,

On Tue, Feb 7, 2017 at 9:23 AM, Robert Haas  wrote:
> On Mon, Feb 6, 2017 at 10:40 PM, Amit Kapila  wrote:
>>> Maybe we should call them "unused pages".
>>
>> +1.  If we consider some more names for that column then probably one
>> alternative could be "empty pages".
>
> Yeah, but I think "unused" might be better.  Because a page could be
> in use (as an overflow page or primary bucket page) and still be
> empty.
>

Based on the earlier discussions, I have prepared a patch that would
allow pgstathashindex() to show the number of unused pages in hash
index. Please find the attached patch for the same. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From e3b59fa85f16d6d15be5360e85b7faf63e8683a9 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Thu, 23 Mar 2017 23:02:26 +0530
Subject: [PATCH] Allow pgstathashindex to show unused pages v1

---
 contrib/pgstattuple/expected/pgstattuple.out  | 12 ++--
 contrib/pgstattuple/pgstatindex.c | 19 ---
 contrib/pgstattuple/pgstattuple--1.4--1.5.sql |  1 +
 doc/src/sgml/pgstattuple.sgml | 17 -
 4 files changed, 31 insertions(+), 18 deletions(-)

diff --git a/contrib/pgstattuple/expected/pgstattuple.out b/contrib/pgstattuple/expected/pgstattuple.out
index 2c3515b..1f1ff46 100644
--- a/contrib/pgstattuple/expected/pgstattuple.out
+++ b/contrib/pgstattuple/expected/pgstattuple.out
@@ -132,9 +132,9 @@ select * from pgstatginindex('test_ginidx');
 
 create index test_hashidx on test using hash (b);
 select * from pgstathashindex('test_hashidx');
- version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | live_items | dead_items | free_percent 
--+--++--++++--
-   2 |4 |  0 |1 |  0 |  0 |  0 |  100
+ version | bucket_pages | overflow_pages | bitmap_pages | zero_pages | unused_pages | live_items | dead_items | free_percent 
+-+--++--++--+++--
+   2 |4 |  0 |1 |  0 |0 |  0 |  0 |  100
 (1 row)
 
 -- these should error with the wrong type
@@ -233,9 +233,9 @@ select pgstatindex('test_partition_idx');
 (1 row)
 
 select pgstathashindex('test_partition_hash_idx');
-   pgstathashindex   
--
- (2,8,0,1,0,0,0,100)
+pgstathashindex
+---
+ (2,8,0,1,0,0,0,0,100)
 (1 row)
 
 drop table test_partitioned;
diff --git a/contrib/pgstattuple/pgstatindex.c b/contrib/pgstattuple/pgstatindex.c
index d448e9e..6fc41d6 100644
--- a/contrib/pgstattuple/pgstatindex.c
+++ b/contrib/pgstattuple/pgstatindex.c
@@ -120,6 +120,7 @@ typedef struct HashIndexStat
 	BlockNumber overflow_pages;
 	BlockNumber bitmap_pages;
 	BlockNumber zero_pages;
+	BlockNumber unused_pages;
 
 	int64	live_items;
 	int64	dead_items;
@@ -588,8 +589,8 @@ pgstathashindex(PG_FUNCTION_ARGS)
 	BufferAccessStrategy bstrategy;
 	HeapTuple	tuple;
 	TupleDesc	tupleDesc;
-	Datum		values[8];
-	bool		nulls[8];
+	Datum		values[9];
+	bool		nulls[9];
 	Buffer		metabuf;
 	HashMetaPage	metap;
 	float8		free_percent;
@@ -667,6 +668,8 @@ pgstathashindex(PG_FUNCTION_ARGS)
 			}
 			else if (opaque->hasho_flag & LH_BITMAP_PAGE)
 stats.bitmap_pages++;
+			else if (PageIsEmpty(page))
+stats.unused_pages++;
 			else
 ereport(ERROR,
 		(errcode(ERRCODE_INDEX_CORRUPTED),
@@ -680,8 +683,9 @@ pgstathashindex(PG_FUNCTION_ARGS)
 	/* Done accessing the index */
 	index_close(rel, AccessShareLock);
 
-	/* Count zero pages as free space. */
-	stats.free_space += stats.zero_pages * stats.space_per_page;
+	/* Count zero and unused pages as free space. */
+	stats.free_space += (stats.zero_pages + stats.unused_pages) *
+		 stats.space_per_page;
 
 	/*
 	 * Total space available for tuples excludes the metapage and the bitmap
@@ -711,9 +715,10 @@ pgstathashindex(PG_FUNCTION_ARGS)
 	values[2] = Int64GetDatum((int64) stats.overflow_pages);
 	values[3] = Int64GetDatum((int64) stats.bitmap_pages);
 	values[4] = Int64GetDatum((int64) stats.zero_pages);
-	values[5] = Int64GetDatum(stats.live_items);
-	values[6] = Int64GetDatum(stats.dead_items);
-	values[7] = Float8GetDatum(free_percent);
+	values[5] = Int64GetDatum((int64) stats.unused_pages);
+	values[6] = Int64GetDatum(stats.live_items);
+	values[7] = Int64GetDatum(stats.dead_items);
+	values[8] = Float8GetDatum(free_percent);
 	tuple = heap_form_tuple(tupleDesc, values, nulls);
 
 	PG_RETURN_DATUM(HeapTupleGetDatum(tuple));
diff --git a/contrib/pgstattuple/pgstattuple--1.4--1.5.sql b/contrib/pgstattuple/pgstattuple--1.4--1.5.sql
index 84e112e..eda719a 100644
--- a/contr

Re: [HACKERS] Page Scan Mode in Hash Index

2017-03-23 Thread Ashutosh Sharma

On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen
 wrote:
> Hi,
>
> On 03/22/2017 09:32 AM, Ashutosh Sharma wrote:
>>
>> Done. Please refer to the attached v2 version of patch.
>>
>
> Thanks.
>
>>>> 1) 0001-Rewrite-hash-index-scans-to-work-a-page-at-a-time.patch: this
>>>> patch rewrites the hash index scan module to work in page-at-a-time
>>>> mode. It basically introduces two new functions-- _hash_readpage() and
>>>> _hash_saveitem(). The former is used to load all the qualifying tuples
>>>> from a target bucket or overflow page into an items array. The latter
>>>> one is used by _hash_readpage to save all the qualifying tuples found
>>>> in a page into an items array. Apart from that, this patch bascially
>>>> cleans _hash_first(), _hash_next and hashgettuple().
>>>>
>
> 0001v2:
>
> In hashgettuple() you can remove the 'currItem' and 'offnum' from the 'else'
> part, and do the assignment inside
>
>   if (so->numKilled < MaxIndexTuplesPerPage)
>
> instead.
>

Done. Please have a look into the attached v3 patch.

>
> No new comments for 0002 and 0003.

okay. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 4e953c35da2274165b00d763500b83e0f3f9e2a9 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Thu, 23 Mar 2017 23:36:05 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev3

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README   |   9 +-
 src/backend/access/hash/hash.c   | 121 +++--
 src/backend/access/hash/hashpage.c   |  14 +-
 src/backend/access/hash/hashsearch.c | 330 ++-
 src/backend/access/hash/hashutil.c   |  23 ++-
 src/include/access/hash.h|  44 +
 6 files changed, 385 insertions(+), 156 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
- the pin on the primary bucket throughout the scan; we also maintain
- a pin on the page currently being scanned)
-	get tuple
-	release content lock
+ the pin on the primary bucket throughout the scan)
+save all the matching tuples from current index page into an items array
+release pin and content lock (but if it is primary bucket page retain
+it's pin till the end of scan)
+get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..8c28fbd 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,23 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (

Re: [HACKERS] Page Scan Mode in Hash Index

2017-03-23 Thread Ashutosh Sharma

> Hi,
>
> On 03/23/2017 02:11 PM, Ashutosh Sharma wrote:
>>
>> On Thu, Mar 23, 2017 at 8:29 PM, Jesper Pedersen
>>  wrote:
>>>
>>> 0001v2:
>>>
>>> In hashgettuple() you can remove the 'currItem' and 'offnum' from the
>>> 'else'
>>> part, and do the assignment inside
>>>
>>>   if (so->numKilled < MaxIndexTuplesPerPage)
>>>
>>> instead.
>>>
>>
>> Done. Please have a look into the attached v3 patch.
>>
>>>
>>> No new comments for 0002 and 0003.
>>
>>
>> okay. Thanks.
>>
>
> I'll keep the entry in 'Needs Review' if Alexander, or others, want to add
> their feedback.

okay. Thanks.

>
> (Best to post the entire patch series each time)

I take your suggestion. Please find the attachments.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 4e953c35da2274165b00d763500b83e0f3f9e2a9 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Thu, 23 Mar 2017 23:36:05 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev3

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README   |   9 +-
 src/backend/access/hash/hash.c   | 121 +++--
 src/backend/access/hash/hashpage.c   |  14 +-
 src/backend/access/hash/hashsearch.c | 330 ++-
 src/backend/access/hash/hashutil.c   |  23 ++-
 src/include/access/hash.h|  44 +
 6 files changed, 385 insertions(+), 156 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
- the pin on the primary bucket throughout the scan; we also maintain
- a pin on the page currently being scanned)
-	get tuple
-	release content lock
+ the pin on the primary bucket throughout the scan)
+save all the matching tuples from current index page into an items array
+release pin and content lock (but if it is primary bucket page retain
+it's pin till the end of scan)
+get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..8c28fbd 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,23 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
- RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previousl

Re: [HACKERS] pageinspect and hash indexes

2017-03-23 Thread Ashutosh Sharma

On Fri, Mar 24, 2017 at 9:21 AM, Amit Kapila  wrote:
> On Thu, Mar 23, 2017 at 6:46 PM, Ashutosh Sharma  
> wrote:
>>>
>>> Oh, okay, but my main objection was that we should not check hash page
>>> type (hasho_flag) without ensuring whether it is a hash page.  Can you
>>> try to adjust the above code so that this check can be moved after
>>> hasho_page_id check?
>>
>> Yes, I got your point. I have done that but then i had to remove the
>> check for PageIsEmpty(). Anyways, I think PageIsEmpty() condition will
>> only be true for one page in entire hash index table and can be
>> ignored. If you wish, I could mention about it in the documentation.
>>
>
> Yeah, I think it is worth adding a Note in docs, but we can do that
> separately if required.
>
>>>
>>>> To avoid
>>>> this, at the start of verify_hash_page function itself if we recognise
>>>> page as UNUSED page we return immediately.
>>>>
>>>>>
>>>>> 2.
>>>>> + /* Check if it is an empty hash page. */
>>>>> + if (PageIsEmpty(page))
>>>>> + ereport(ERROR,
>>>>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>>>>> + errmsg("index table contains empty page")));
>>>>>
>>>>>
>>>>> Do we want to give a separate message for EMPTY and NEW pages?  Isn't
>>>>> it better that the same error message can be given for both of them as
>>>>> from user perspective there is not much difference between both the
>>>>> messages?
>>>>
>>>> I think we should show separate message because they are two different
>>>> type of pages. In the sense like, one is initialised whereas other is
>>>> completely zero.
>>>>
>>>
>>> I understand your point, but not sure if it makes any difference to user.
>>>
>>
>> okay, I have now anyways removed the check for PageIsEmpty. Please
>> refer to the attached '0002
>> allow_pageinspect_handle_UNUSED_hash_pages.patch'
>>
>
> +
> if (pageopaque->hasho_page_id != HASHO_PAGE_ID)
> ereport(ERROR,
>
> spurious white space.
>
>>
>> Also, I have attached
>> '0001-Mark-freed-overflow-page-as-UNUSED-pagetype-v2.patch' that
>> handles your comment mentioned in [1].
>>
>
> In general, we have to initialize prevblock with max_bucket, but here
> it is okay, because we anyway initialize it after this page is
> allocated as overflow page.
>
> Your patches look good to me except for small white space change.

Thanks for reviewing my patch. I have removed the extra white space.
Attached are both the patches.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 86eb711ea80b495c9d2f5228558682d0e95c41eb Mon Sep 17 00:00:00 2001
From: ashu 
Date: Thu, 23 Mar 2017 16:47:17 +0530
Subject: [PATCH] Mark freed overflow page as UNUSED pagetype v2

---
 src/backend/access/hash/hash_xlog.c |  9 +
 src/backend/access/hash/hashovfl.c  | 13 -
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d9ac42c..2ccaf46 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -697,11 +697,20 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
 	{
 		Page		ovflpage;
+		HashPageOpaque ovflopaque;
 
 		ovflpage = BufferGetPage(ovflbuf);
 
 		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
 
+		ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+
+		ovflopaque->hasho_prevblkno = InvalidBlockNumber;
+		ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+		ovflopaque->hasho_bucket = -1;
+		ovflopaque->hasho_flag = LH_UNUSED_PAGE;
+		ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
 		PageSetLSN(ovflpage, lsn);
 		MarkBufferDirty(ovflbuf);
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..d3eaee0 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -591,9 +591,20 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,
 	/*
 	 * Initialize the freed overflow page.  Just zeroing the page won't work,
 	 * because WAL replay routines expect pages to be initialized. See
-	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended.
+	 * explanation of RBM_NORMAL mode atop XLogReadBufferExtended. Also, mark
+	 * the page as UNUSED type and retain it's page id. This allows the tools
+	 * like pageinspect to consider it as a hash page.

Re: [HACKERS] pageinspect and hash indexes

2017-03-23 Thread Ashutosh Sharma

On Fri, Mar 24, 2017 at 9:46 AM, Ashutosh Sharma  wrote:
> On Fri, Mar 24, 2017 at 9:21 AM, Amit Kapila  wrote:
>> On Thu, Mar 23, 2017 at 6:46 PM, Ashutosh Sharma  
>> wrote:
>>>>
>>>> Oh, okay, but my main objection was that we should not check hash page
>>>> type (hasho_flag) without ensuring whether it is a hash page.  Can you
>>>> try to adjust the above code so that this check can be moved after
>>>> hasho_page_id check?
>>>
>>> Yes, I got your point. I have done that but then i had to remove the
>>> check for PageIsEmpty(). Anyways, I think PageIsEmpty() condition will
>>> only be true for one page in entire hash index table and can be
>>> ignored. If you wish, I could mention about it in the documentation.
>>>
>>
>> Yeah, I think it is worth adding a Note in docs, but we can do that
>> separately if required.
>>
>>>>
>>>>> To avoid
>>>>> this, at the start of verify_hash_page function itself if we recognise
>>>>> page as UNUSED page we return immediately.
>>>>>
>>>>>>
>>>>>> 2.
>>>>>> + /* Check if it is an empty hash page. */
>>>>>> + if (PageIsEmpty(page))
>>>>>> + ereport(ERROR,
>>>>>> + (errcode(ERRCODE_INDEX_CORRUPTED),
>>>>>> + errmsg("index table contains empty page")));
>>>>>>
>>>>>>
>>>>>> Do we want to give a separate message for EMPTY and NEW pages?  Isn't
>>>>>> it better that the same error message can be given for both of them as
>>>>>> from user perspective there is not much difference between both the
>>>>>> messages?
>>>>>
>>>>> I think we should show separate message because they are two different
>>>>> type of pages. In the sense like, one is initialised whereas other is
>>>>> completely zero.
>>>>>
>>>>
>>>> I understand your point, but not sure if it makes any difference to user.
>>>>
>>>
>>> okay, I have now anyways removed the check for PageIsEmpty. Please
>>> refer to the attached '0002
>>> allow_pageinspect_handle_UNUSED_hash_pages.patch'
>>>
>>
>> +
>> if (pageopaque->hasho_page_id != HASHO_PAGE_ID)
>>     ereport(ERROR,
>>
>> spurious white space.
>>
>>>
>>> Also, I have attached
>>> '0001-Mark-freed-overflow-page-as-UNUSED-pagetype-v2.patch' that
>>> handles your comment mentioned in [1].
>>>
>>
>> In general, we have to initialize prevblock with max_bucket, but here
>> it is okay, because we anyway initialize it after this page is
>> allocated as overflow page.
>>
>> Your patches look good to me except for small white space change.
>
> Thanks for reviewing my patch. I have removed the extra white space.
> Attached are both the patches.

Sorry, I have mistakenly attached wrong patch. Here are the correct
set of patches.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 86eb711ea80b495c9d2f5228558682d0e95c41eb Mon Sep 17 00:00:00 2001
From: ashu 
Date: Thu, 23 Mar 2017 16:47:17 +0530
Subject: [PATCH] Mark freed overflow page as UNUSED pagetype v2

---
 src/backend/access/hash/hash_xlog.c |  9 +
 src/backend/access/hash/hashovfl.c  | 13 -
 2 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index d9ac42c..2ccaf46 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -697,11 +697,20 @@ hash_xlog_squeeze_page(XLogReaderState *record)
 	if (XLogReadBufferForRedo(record, 2, &ovflbuf) == BLK_NEEDS_REDO)
 	{
 		Page		ovflpage;
+		HashPageOpaque ovflopaque;
 
 		ovflpage = BufferGetPage(ovflbuf);
 
 		_hash_pageinit(ovflpage, BufferGetPageSize(ovflbuf));
 
+		ovflopaque = (HashPageOpaque) PageGetSpecialPointer(ovflpage);
+
+		ovflopaque->hasho_prevblkno = InvalidBlockNumber;
+		ovflopaque->hasho_nextblkno = InvalidBlockNumber;
+		ovflopaque->hasho_bucket = -1;
+		ovflopaque->hasho_flag = LH_UNUSED_PAGE;
+		ovflopaque->hasho_page_id = HASHO_PAGE_ID;
+
 		PageSetLSN(ovflpage, lsn);
 		MarkBufferDirty(ovflbuf);
 	}
diff --git a/src/backend/access/hash/hashovfl.c b/src/backend/access/hash/hashovfl.c
index a3cae21..d3eaee0 100644
--- a/src/backend/access/hash/hashovfl.c
+++ b/src/backend/access/hash/hashovfl.c
@@ -591,9 +591,20 @@ _hash_freeovflpage(Relation rel, Buffer bucketbuf, Buffer ovflbuf,

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-23 Thread Ashutosh Sharma

>>> > I think this will work, but not sure if there is a merit to deviate
>>> > from what btree does to handle this case.   One thing I find slightly
>>> > awkward in hash_xlog_vacuum_get_latestRemovedXid() is that you are
>>> > using a number of tuples registered as part of fixed data
>>> > (xl_hash_vacuum_one_page) to traverse the data registered as buf data.
>>> > I think it will be better if we register offsets also in fixed part of
>>> > data as we are doing btree case.
>>
>> Agreed. I have made the changes accordingly. Please check attached v2 patch.
>>
>
> Changes look good to me.   I think you can modify the comments in
> structure xl_hash_vacuum_one_page to mention "TARGET OFFSET NUMBERS
> FOLLOW AT THE END"
>

Added the comment in xl_hash_vacuum_one_page structure.

>>>
>>> >
>>> >
>>>
>>> Also another small point in this regard, do we need two separate
>>> variables to track number of deleted items in below code?  I think one
>>> variable is sufficient.
>>>
>>> _hash_vacuum_one_page()
>>> {
>>> ..
>>> deletable[ndeletable++] = offnum;
>>> tuples_removed += 1;--
>>>
>>
>> Yes, I think 'ndeletable' alone should be fine.
>>
>
> I think it would have been probably okay to use *int* for ntuples as
> that matches with what you are actually assigning in the function.

okay, corrected it. Attached is newer version of patch.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/src/backend/access/hash/hash_xlog.c b/src/backend/access/hash/hash_xlog.c
index da4c2c5..2ccaf46 100644
--- a/src/backend/access/hash/hash_xlog.c
+++ b/src/backend/access/hash/hash_xlog.c
@@ -966,8 +966,6 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
 	OffsetNumber	hoffnum;
 	TransactionId	latestRemovedXid = InvalidTransactionId;
 	int		i;
-	char *ptr;
-	Size len;
 
 	xlrec = (xl_hash_vacuum_one_page *) XLogRecGetData(record);
 
@@ -986,12 +984,20 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
 		return latestRemovedXid;
 
 	/*
+	 * Check if WAL replay has reached a consistent database state. If not,
+	 * we must PANIC. See the definition of btree_xlog_delete_get_latestRemovedXid
+	 * for more details.
+	 */
+	if (!reachedConsistency)
+		elog(PANIC, "hash_xlog_vacuum_get_latestRemovedXid: cannot operate with inconsistent data");
+
+	/*
 	 * Get index page.  If the DB is consistent, this should not fail, nor
 	 * should any of the heap page fetches below.  If one does, we return
 	 * InvalidTransactionId to cancel all HS transactions.  That's probably
 	 * overkill, but it's safe, and certainly better than panicking here.
 	 */
-	XLogRecGetBlockTag(record, 1, &rnode, NULL, &blkno);
+	XLogRecGetBlockTag(record, 0, &rnode, NULL, &blkno);
 	ibuffer = XLogReadBufferExtended(rnode, MAIN_FORKNUM, blkno, RBM_NORMAL);
 
 	if (!BufferIsValid(ibuffer))
@@ -1003,9 +1009,7 @@ hash_xlog_vacuum_get_latestRemovedXid(XLogReaderState *record)
 	 * Loop through the deleted index items to obtain the TransactionId from
 	 * the heap items they point to.
 	 */
-	ptr = XLogRecGetBlockData(record, 1, &len);
-
-	unused = (OffsetNumber *) ptr;
+	unused = (OffsetNumber *) ((char *) xlrec + SizeOfHashVacuumOnePage);
 
 	for (i = 0; i < xlrec->ntuples; i++)
 	{
@@ -1130,23 +1134,15 @@ hash_xlog_vacuum_one_page(XLogReaderState *record)
 
 	if (action == BLK_NEEDS_REDO)
 	{
-		char *ptr;
-		Size len;
-
-		ptr = XLogRecGetBlockData(record, 0, &len);
-
 		page = (Page) BufferGetPage(buffer);
 
-		if (len > 0)
+		if (XLogRecGetDataLen(record) > SizeOfHashVacuumOnePage)
 		{
 			OffsetNumber *unused;
-			OffsetNumber *unend;
 
-			unused = (OffsetNumber *) ptr;
-			unend = (OffsetNumber *) ((char *) ptr + len);
+			unused = (OffsetNumber *) ((char *) xldata + SizeOfHashVacuumOnePage);
 
-			if ((unend - unused) > 0)
-PageIndexMultiDelete(page, unused, unend - unused);
+			PageIndexMultiDelete(page, unused, xldata->ntuples);
 		}
 
 		/*
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 8640e85..8699b5b 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -344,7 +344,6 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 	Page	page = BufferGetPage(buf);
 	HashPageOpaque	pageopaque;
 	HashMetaPage	metap;
-	double tuples_removed = 0;
 
 	/* Scan each tuple in page to see if it is marked as LP_DEAD */
 	maxoff = PageGetMaxOffsetNumber(page);
@@ -355,10 +354,7 @@ _hash_vacuum_one_page(Relation rel, Buffer metabuf, Buffer buf,
 		ItemId	itemId = PageGetItemId(page, offnum);
 
 		if (ItemIdIsDead(itemId))
-		{
 			deletable[ndeletable++] = off

Re: [HACKERS] Supporting huge pages on Windows

2017-03-24 Thread Ashutosh Sharma

Hi,

On Fri, Mar 24, 2017 at 9:00 PM, David Steele  wrote:
> Hi Ashutosh,
>
> On 3/22/17 8:52 AM, Amit Kapila wrote:
>>
>> On Wed, Mar 22, 2017 at 12:07 AM, David Steele 
>> wrote:
>>>
>>>
>>> Amit, Magnus, you are signed up as reviewers for this patch.  Do you know
>>> when you'll have a chance to take a look?
>>>
>>
>> I have provided my feedback and I could not test it on my machine.  I
>> think Ashutosh Sharma has tested it.  I can give it another look, but
>> not sure if it helps.
>
>
> I know you are not signed up as a reviewer, but have you take a look at this
> patch?

Well, I had a quick look into the patch just because I wanted the test
it as I am having windows setup. But, I never looked into the patch
from reviewer's perspective. The intention was to reverify the test
results shared by Tsunawaka.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] pageinspect and hash indexes

2017-03-24 Thread Ashutosh Sharma

 Thanks for reviewing my patch. I have removed the extra white space.
 Attached are both the patches.
>>>
>>> Sorry, I have mistakenly attached wrong patch. Here are the correct
>>> set of patches.
>>
>> The latest version of patches looks fine to me.
>
> I don't really like 0002.  What about this, instead?
>
> --- a/contrib/pageinspect/hashfuncs.c
> +++ b/contrib/pageinspect/hashfuncs.c
> @@ -80,7 +80,8 @@ verify_hash_page(bytea *raw_page, int flags)
>  /* Check that page type is sane. */
>  pagetype = pageopaque->hasho_flag & LH_PAGE_TYPE;
>  if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE &&
> -pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE)
> +pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE &&
> +pagetype != LH_UNUSED_PAGE)
>  ereport(ERROR,
>  (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
>   errmsg("invalid hash page type %08x", pagetype)));
>
> The advantage of that is (1) it won't get confused if in the future we
> have an unused page that has some flag bit not in LH_PAGE_TYPE set,
> and (2) if in the future we want to add any other checks to this
> function which should apply to unused pages also, they won't get
> bypassed by an early return statement.

Agreed. Moreover, previous approach might even change the current
behaviour of functions like hash_page_items() and hash_page_stats()
basically when we pass UNUSED pages to these functions. Attached is
the newer version of patch.
From 5c54fa58895cd279ff44424a3eb1feafdaca69d5 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Sat, 25 Mar 2017 01:02:35 +0530
Subject: [PATCH] Allow pageinspect to handle UNUSED hash pages v3

---
 contrib/pageinspect/hashfuncs.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/contrib/pageinspect/hashfuncs.c b/contrib/pageinspect/hashfuncs.c
index 812a03f..2632287 100644
--- a/contrib/pageinspect/hashfuncs.c
+++ b/contrib/pageinspect/hashfuncs.c
@@ -80,7 +80,8 @@ verify_hash_page(bytea *raw_page, int flags)
 	/* Check that page type is sane. */
 	pagetype = pageopaque->hasho_flag & LH_PAGE_TYPE;
 	if (pagetype != LH_OVERFLOW_PAGE && pagetype != LH_BUCKET_PAGE &&
-		pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE)
+		pagetype != LH_BITMAP_PAGE && pagetype != LH_META_PAGE &&
+		pagetype != LH_UNUSED_PAGE)
 		ereport(ERROR,
 (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
  errmsg("invalid hash page type %08x", pagetype)));
-- 
1.8.3.1


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.

2017-03-25 Thread Ashutosh Sharma

On Sat, Mar 25, 2017 at 11:02 AM, Amit Kapila  wrote:
> On Thu, Mar 23, 2017 at 11:24 PM, Ashutosh Sharma  
> wrote:
>> Hi,
>>
>> On Tue, Feb 7, 2017 at 9:23 AM, Robert Haas  wrote:
>>> On Mon, Feb 6, 2017 at 10:40 PM, Amit Kapila  
>>> wrote:
>>>>> Maybe we should call them "unused pages".
>>>>
>>>> +1.  If we consider some more names for that column then probably one
>>>> alternative could be "empty pages".
>>>
>>> Yeah, but I think "unused" might be better.  Because a page could be
>>> in use (as an overflow page or primary bucket page) and still be
>>> empty.
>>>
>>
>> Based on the earlier discussions, I have prepared a patch that would
>> allow pgstathashindex() to show the number of unused pages in hash
>> index. Please find the attached patch for the same. Thanks.
>>
>
>   else if (opaque->hasho_flag & LH_BITMAP_PAGE)
>   stats.bitmap_pages++;
> + else if (PageIsEmpty(page))
> + stats.unused_pages++;
>
> I think having this check after PageIsNew() makes more sense then
> having at the place where you currently have,

I don't think having it just below PageIsNew() check would be good.
The reason being, there can be a bucket page which is in use but still
be empty. So, if we add a check just below PageIsNew check then even
though a page is marked as bucket page and is empty it will be shown
as unused page which i feel is not correct. Here is one simple example
that illustrates this point,

Query:
==
select hash_page_type(get_raw_page('con_hash_index', 17));

gdb shows following info for this block number 17,

(gdb) p *(PageHeader)page
$1 = {pd_lsn = {xlogid = 0, xrecoff = 122297104}, pd_checksum = 0,
pd_flags = 0, pd_lower = 24,pd_upper = 8176, pd_special = 8176,
  pd_pagesize_version = 8196, pd_prune_xid = 0,pd_linp = 0x22a82d8}

(gdb) p((HashPageOpaque) PageGetSpecialPointer(page))->hasho_flag
$2 = 66

(gdb) p(((HashPageOpaque) PageGetSpecialPointer(page))->hasho_flag
& LH_BUCKET_PAGE)
$3 = 2

>From above information it is clear that this page is a bucket page and
is empty. I think it should be shown as bucket page rather than unused
page. Also,  this has already been discussed in [1].

other than that
> code-wise your patch looks okay, although I haven't tested it.
>

I have tested it properly and didn't find any issues.

> I think this should also be tracked under PostgreSQL 10 open items,
> but as this is not directly a bug, so not sure, what do others think?

Well, Even I am not sure if this has to be added under open items list
or not. Thanks.

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmobwLKpKe99qnTJCBzFB4UaVGKrLNX3hwp8wcxObMDx7pA%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] comments in hash_alloc_buckets

2017-03-25 Thread Ashutosh Sharma

>> While working on - [1], I realised that the following comments in
>> _hash_alloc_buckets() needs to be corrected.
>>
>> /*
>>  * Initialize the freed overflow page.  Just zeroing the page won't work,
>>  * See _hash_freeovflpage for similar usage.
>>  */
>> _hash_pageinit(page, BLCKSZ);
>>
>> Attached is the patch that corrects above comment. Thanks.
>>
>
> - * Initialize the freed overflow page.  Just zeroing the page won't work,
> + * Initialize the last page in hash index.
>
> I think saying ".. last page in hash index" sounds slightly awkward as
> this is the last page for current split point, how about just
> "Initialize the page. ..."

Yes, I mean just adding "Initialize the page. ..." looks more simple
and correct. Attached is the patch with similar comment.
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 622cc4b..4fd9cbc 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -1002,7 +1002,7 @@ _hash_alloc_buckets(Relation rel, BlockNumber firstblock, uint32 nblocks)
 	page = (Page) zerobuf;
 
 	/*
-	 * Initialize the freed overflow page.  Just zeroing the page won't work,
+	 * Initialize the page.  Just zeroing the page won't work,
 	 * See _hash_freeovflpage for similar usage.
 	 */
 	_hash_pageinit(page, BLCKSZ);

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Add pgstathashindex() to get hash index table statistics.

2017-03-25 Thread Ashutosh Sharma

>>>>>>
>>>>>> +1.  If we consider some more names for that column then probably one
>>>>>> alternative could be "empty pages".
>>>>>
>>>>> Yeah, but I think "unused" might be better.  Because a page could be
>>>>> in use (as an overflow page or primary bucket page) and still be
>>>>> empty.
>>>>>
>>>>
>>>> Based on the earlier discussions, I have prepared a patch that would
>>>> allow pgstathashindex() to show the number of unused pages in hash
>>>> index. Please find the attached patch for the same. Thanks.
>>>>
>>>
>>>   else if (opaque->hasho_flag & LH_BITMAP_PAGE)
>>>   stats.bitmap_pages++;
>>> + else if (PageIsEmpty(page))
>>> + stats.unused_pages++;
>>>
>>> I think having this check after PageIsNew() makes more sense then
>>> having at the place where you currently have,
>>
>> I don't think having it just below PageIsNew() check would be good.
>> The reason being, there can be a bucket page which is in use but still
>> be empty. So, if we add a check just below PageIsNew check then even
>> though a page is marked as bucket page and is empty it will be shown
>> as unused page which i feel is not correct. Here is one simple example
>> that illustrates this point,
>>
>
> oh, is it a page where all the items have been deleted and no new
> items have been inserted?

Yes, it is a page from where items have been removed and no new
insertion has happened.

The reason why I have told that place is
> not appropriate is because all the other checks in near by code are on
> the page type and this check looks odd at that place, but we might
> need to do this way if there is no other clean solution.

I got your point but then i think that is the only one solution we
have as of now.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-03-27 Thread Ashutosh Sharma

Hi,

> testing with master as of cf366e97ff, sqlsmith occasionally triggers the
> following assertion:
>
> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*) 
> (&(bufHdr)->content_lock", File: "bufmgr.c", Line: 3397)
>
> Backtraces always look like the one below.  It is reproducible on a
> cluster once it happens.  I could provide a tarball if needed.
>
> regards,
> Andreas
>
> #2  0x008324b1 in ExceptionalCondition 
> (conditionName=conditionName@entry=0x9e4e28 "!(LWLockHeldByMe(((LWLock*) 
> (&(bufHdr)->content_lock", errorType=errorType@entry=0x87b03d 
> "FailedAssertion", fileName=fileName@entry=0x9e5856 "bufmgr.c", 
> lineNumber=lineNumber@entry=3397) at assert.c:54
> #3  0x00706971 in MarkBufferDirtyHint (buffer=2844, 
> buffer_std=buffer_std@entry=1 '\001') at bufmgr.c:3397
> #4  0x004b3ecd in _hash_kill_items (scan=scan@entry=0x66dcf70) at 
> hashutil.c:514
> #5  0x004a9c1b in hashendscan (scan=0x66dcf70) at hash.c:512
> #6  0x004cf17a in index_endscan (scan=0x66dcf70) at indexam.c:353
> #7  0x0061fa51 in ExecEndIndexScan (node=0x3093f30) at 
> nodeIndexscan.c:852
> #8  0x00608e59 in ExecEndNode (node=) at 
> execProcnode.c:715
> #9  0x006045b8 in ExecEndPlan (estate=0x3064000, planstate= out>) at execMain.c:1540
> #10 standard_ExecutorEnd (queryDesc=0x30cb880) at execMain.c:487
> #11 0x005c87b0 in PortalCleanup (portal=0x1a60060) at portalcmds.c:302
> #12 0x0085cbb3 in PortalDrop (portal=0x1a60060, 
> isTopCommit=) at portalmem.c:489
> #13 0x00736ed2 in exec_simple_query (query_string=0x315b7a0 "...") at 
> postgres.c:
> #14 0x00738b51 in PostgresMain (argc=, 
> argv=argv@entry=0x1a6c6c8, dbname=, username=) 
> at postgres.c:4071
> #15 0x00475fef in BackendRun (port=0x1a65b90) at postmaster.c:4317
> #16 BackendStartup (port=0x1a65b90) at postmaster.c:3989
> #17 ServerLoop () at postmaster.c:1729
> #18 0x006c8662 in PostmasterMain (argc=argc@entry=4, 
> argv=argv@entry=0x1a3f540) at postmaster.c:1337
> #19 0x0047729d in main (argc=4, argv=0x1a3f540) at main.c:228
>


Hi,

Thanks for reporting this problem. Could you please let me know on for
how long did you run sqlsmith to get this crash. However, I have found
the reason for this crash. This is basically happening when trying to
retrieve the tuples using cursor. Basically the current hash index
scan work tuple-at-a-time which means once it finds tuple on page, it
releases lock from the page but keeps pin on it and finally returns
the tuple. When the requested number of tuples are processed there is
no lock on the page that was being scanned but yes there is a pin on
it. Finally, when trying to close a cursor at the end of scan, if any
killed tuples has been identified we try to first mark these items as
dead with the help of _hash_kill_items(). But, since we only have pin
on this page, the assert check 'LWLockHeldByMe()' fails.

When scanning tuples using normal SELECT * statement, before moving to
next page in a bucket we first deal with all the killed items but we
do this without releasing lock and pin on the current page. Hence,
with SELECT queries this crash is not visible.

The attached patch fixes this. But, please note that all these changes
will get removed with the patch for page scan mode - [1].

[1] - 
https://www.postgresql.org/message-id/CA%2BTgmobYTvexcjqMhXoNCyEUHChzmdC_2xVGgj7eqaYVgoJA%2Bg%40mail.gmail.com

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..0bacef8 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -478,7 +478,7 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
-		_hash_kill_items(scan);
+		_hash_kill_items(scan, false);
 
 	_hash_dropscanbuf(rel, so);
 
@@ -509,7 +509,7 @@ hashendscan(IndexScanDesc scan)
 
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
-		_hash_kill_items(scan);
+		_hash_kill_items(scan, false);
 
 	_hash_dropscanbuf(rel, so);
 
diff --git a/src/backend/access/hash/hashsearch.c b/src/backend/access/hash/hashsearch.c
index 2d92049..414cc6a 100644
--- a/src/backend/access/hash/hashsearch.c
+++ b/src/backend/access/hash/hashsearch.c
@@ -467,7 +467,7 @@ _hash_step(IndexScanDesc scan, Buffer *bufP, ScanDirection dir)
 
 	/* Before leaving current page, deal with any killed items */
 	if (so->numKilled > 0)
-		_hash_kill_items(scan);
+		_hash_kill_items(scan, true);
 
 	/*
 	 * ran off the end of this page, try

Re: [HACKERS] Page Scan Mode in Hash Index

2017-03-27 Thread Ashutosh Sharma

Hi,

> I think you should consider refactoring this so that it doesn't need
> to use goto.  Maybe move the while (offnum <= maxoff) logic into a
> helper function and have it return itemIndex.  If itemIndex == 0, you
> can call it again.

okay, Added a helper function for _hash_readpage(). Please check v4
patch attached with this mail.

Notice that the code in the "else" branch of the
> if (itemIndex == 0) stuff could actually be removed from the else
> block without changing anything, because the "if" block always either
> returns or does a goto.  That's worthy of a little more work to try to
> make things simple and clear.

Corrected.

>
> + *  We find the first item(or, if backward scan, the last item) in
>
> Missing space.
>

Corrected.

> In _hash_dropscanbuf(), the handling of hashso_bucket_buf is now
> inconsistent with the handling of hashso_split_bucket_buf, which seems
> suspicious.

I have corrected it.

Suppose we enter this function with so->hashso_bucket_buf
> and so->currPos.buf both being valid buffers, but not the same one.
> It looks to me as if we'll release the first pin but not the second
> one.

Yes, that is because we no need to release pin on currPos.buf if it is
an overflow page. In page scan mode once we have saved all the
qualified tuples from a current page (could be an overflow page) into
items array we do release both pin and lock on a overflow page. This
was not true earlier, let us assume a case where we are supposed to
fetch only fixed number of tuples from a page using cursor and in such
a case once the number of tuples to be fetched is completed we simply
return with out releasing pin on a page being scanned. If this page is
an overflow page then by end of scan we will end up with pin held on
two buffers i.e. bucket buf and current buf which is basically
overflow buf.

My guess (which could be wrong) is that so->hashso_bucket_buf =
> InvalidBuffer should be moved back up higher in the function where it
> was before, just after the first if statement, and that the new
> condition so->hashso_bucket_buf == so->currPos.buf on the last
> _hash_dropbuf() should be removed.  If that's not right, then I think
> I need somebody to explain why not.

Okay, as i mentioned above, in case of page scan mode we only keep pin
on a bucket buf. There won't be any case where we will be having pin
on overflow buf at the end of scan. So, basically _hash_dropscan buf()
should only attempt to release pin on a bucket buf. And an attempt to
release pin on so->currPos.buf should only happen when
'so->hashso_bucket_buf == so->currPos.buf' or when
'so->hashso_split_bucket_buf == so->currPos.buf'

>
> I am suspicious that _hash_kill_items() is going to have problems if
> the overflow page is freed before it reacquires the lock.
> _btkillitems() contains safeguards against similar cases.

I have added similar check for hash_kill_items() as well.

>
> This is not a full review, but I'm out of time for the moment.

No worries. I will be ready for your further review comments any time.
Thanks for the review.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
From 3d0273f503d1645d6289bda78946a0af4b9e9f3a Mon Sep 17 00:00:00 2001
From: ashu 
Date: Mon, 27 Mar 2017 18:22:15 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev4

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README   |   9 +-
 src/backend/access/hash/hash.c   | 124 ++
 src/backend/access/hash/hashpage.c   |  17 +-
 src/backend/access/hash/hashsearch.c | 445 +++
 src/backend/access/hash/hashutil.c   |  42 +++-
 src/include/access/hash.h|  46 +++-
 6 files changed, 515 insertions(+), 168 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
- the pin on the primary bucket throughout the scan; we also maintain
- a pin on the page currently being scanned)
-	get tuple
-	release content lock
+ the pin on the primary bucket throughout the scan)
+save all the matching tuples from current index page into an items array
+release pin and content lock (but if it is primary bucket page retain
+it's pin till the end of scan)
+get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 0bacef8..bd2827a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-27 Thread Ashutosh Sharma

>> >>
>> >> I think it would have been probably okay to use *int* for ntuples as
>> >> that matches with what you are actually assigning in the function.
>> >
>> > okay, corrected it. Attached is newer version of patch.
>> >
>>
>> Thanks, this version looks good to me.
>
>
> It solves the problem for me.

Great..Thanks for confirming.

I'd like to test that I get the right answer
> on the standby, not just the absence of a crash, but I don't know how to do
> that effectively.  Has anyone used the new wal replay block consistency tool
> on hash indexes since this microvacuum code was committed?

Yes, I have used it for hash index but I used it after below commit,

commit 9abbf4727de746222ad8fc15b17348065389ae43
Author: Robert Haas 
Date:   Mon Mar 20 15:55:27 2017 -0400

Another fix for single-page hash index vacuum.

The WAL consistency checking code needed to be updated for the new
page status bit, but that didn't get done previously.

All I did was set 'wal_consistency_checking = 'hash'' in
postgresql.conf and ran test cases on primary server. If there is any
inconsistent block on standby the tool would probably terminate the
recovery process and you would see following message in the server
logfile.

"inconsistent page found, rel %u/%u/%u, forknum %u, blkno %u"

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] segfault in hot standby for hash indexes

2017-03-27 Thread Ashutosh Sharma

On Mar 27, 2017 22:25, "Robert Haas"  wrote:

On Fri, Mar 24, 2017 at 3:49 AM, Amit Kapila 
wrote:
> On Fri, Mar 24, 2017 at 12:25 PM, Ashutosh Sharma 
wrote:
>>>
>>> I think it would have been probably okay to use *int* for ntuples as
>>> that matches with what you are actually assigning in the function.
>>
>> okay, corrected it. Attached is newer version of patch.
>
> Thanks, this version looks good to me.

Committed.


Thank you.

Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-03-27 Thread Ashutosh Sharma

On Mar 28, 2017 00:00, "Andreas Seltenreich"  wrote:

Ashutosh Sharma writes:
>> TRAP: FailedAssertion("!(LWLockHeldByMe(((LWLock*)
(&(bufHdr)->content_lock", File: "bufmgr.c", Line: 3397)
> Thanks for reporting this problem. Could you please let me know on for
> how long did you run sqlsmith to get this crash.

I got about one TRAP per hour when testing on 20 nodes with one postgres
and 5 sqlsmithes on each.  Nodes are tiny consumer machines with
low-power 4-core sandy bridges.

Okay. Thanks for sharing this information. I tried running running sqlsmith
on a single node. I ran it for an hour but didn't find any crash. Let me
also try running multiple sqlsmith at a time...May be 5 like you are doing
on a single node.

> [2. reacquire_lock_hashkillitems_if_required.patch]

I'll test with your patch applied as soon as time permits and report
back.

Sure, Thanks a lot.

With Regards,
Ashutosh Sharma

Re: [HACKERS] Page Scan Mode in Hash Index

2017-03-29 Thread Ashutosh Sharma

I think you should consider refactoring this so that it doesn't need
>> to use goto.  Maybe move the while (offnum <= maxoff) logic into a
>> helper function and have it return itemIndex.  If itemIndex == 0, you
>> can call it again.
>>
>
> okay, Added a helper function for _hash_readpage(). Please check v4
> patch attached with this mail.
>
> >>

> This is not a full review, but I'm out of time for the moment.
>>
>
> No worries. I will be ready for your further review comments any time.
> Thanks for the review.
>
>
This patch needs a rebase.


Please try applying these patches on top of [1]. I think you should be able
to apply it cleanly. Sorry, I think I forgot to mention this point in my
earlier mail.

[1] -
https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com

[HACKERS] inconsistent page found on STANDBY server

2017-03-30 Thread Ashutosh Sharma

Hi All,

When running make installcheck on a master with wal consistency check
enabled, inconsistent page is detected on standby. I could see the
following FATAL message in the standby server logfile,

2017-03-30 07:31:10.101 BST [27994] LOG:  entering standby mode
2017-03-30 07:31:10.106 BST [27994] LOG:  redo starts at 0/224
2017-03-30 07:31:10.108 BST [27994] LOG:  consistent recovery state
reached at 0/2E4
2017-03-30 07:31:10.108 BST [27992] LOG:  database system is ready to
accept read only connections
2017-03-30 07:31:10.113 BST [27998] LOG:  started streaming WAL from
primary at 0/300 on timeline 1
2017-03-30 07:33:19.040 BST [27994] FATAL:  inconsistent page found,
rel 1663/13157/16391, forknum 0, blkno 0
2017-03-30 07:33:19.040 BST [27994] CONTEXT:  WAL redo at 0/351CF03C
for Hash/UPDATE_META_PAGE: ntuples -nan
2017-03-30 07:33:19.041 BST [27992] LOG:  startup process (PID 27994)
exited with exit code 1

Steps to reproduce:
===
1)PG v10 sources
2)Setup Master/SLAVE replication
3)run make installcheck on Master
4)Check database logs ,generated on SLAVE directory.

Please note that above issue is observed only on 32 bit LInux machine
and was offlist reported to me by Tushar Ahuja. Tushar also allowed me
to use his 32 bit Linux machine to analyse this problem. I also had a
small offlist discussion with Amit (included in this mail) when
analysing this problem.

RCA:

After debugging the hash index code for deletion, I could find that
while registering data for xlog record 'XLOG_HASH_UPDATE_META_PAGE' we
are not passing the correct length of data being registered and
therefore, data (xl_hash_update_meta_page) is not completely recorded
into the wal record.

Fix:
===
Attached patch fixes this issue.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..a5798bd 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -691,7 +691,7 @@ loop_top:
 		xlrec.ntuples = metap->hashm_ntuples;
 
 		XLogBeginInsert();
-		XLogRegisterData((char *) &xlrec, sizeof(SizeOfHashUpdateMetaPage));
+		XLogRegisterData((char *) &xlrec, SizeOfHashUpdateMetaPage);
 
 		XLogRegisterBuffer(0, metabuf, REGBUF_STANDARD);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)

2017-03-31 Thread Ashutosh Sharma

Hi,

On Wed, Mar 29, 2017 at 8:38 PM, Tomas Vondra
 wrote:
>
>
> On 03/24/2017 04:27 AM, Peter Eisentraut wrote:
>>
>> On 3/17/17 18:35, Tomas Vondra wrote:
>>>
>>> On 03/17/2017 05:23 PM, Peter Eisentraut wrote:
>>>>
>>>> I'm struggling to find a good way to share code between
>>>> bt_page_items(text, int4) and bt_page_items(bytea).
>>>>
>>>> If we do it via the SQL route, as I had suggested, it makes the
>>>> extension non-relocatable, and it will also create a bit of a mess
>>>> during upgrades.
>>>>
>>>> If doing it in C, it will be a bit tricky to pass the SRF context
>>>> around.  There is no "DirectFunctionCall within SRF context", AFAICT.
>>>
>>>
>>> Not sure what it has to do with DirectFunctionCall? You want to call the
>>> bytea variant from the existing one? Wouldn't it be easier to simply
>>> define a static function with the shared parts, and pass around the
>>> fctx/fcinfo? Not quite pretty, but should work.
>>
>>
>> Perhaps what was added in
>>
>> <http://git.postgresql.org/pg/commitdiff/29bf5016835a2c2c23786f7cda347716c083d95f>
>> would actually work here.
>>
>
> I've tried to refactor the code using this, but the result was rather ugly
> because (a) it really is quite tricky to pass around the contexts and (b)
> the sanity checks are quite different for the two input types, so mixing
> those parts (essentially the SRF_IS_FIRSTCALL bits) does not make much sense
> IMHO.
>
> The attached patch is the best I came up with - it essentially shares just
> the tuple-forming part, which is exactly the same in both cases.
>
> It also adds the P_ISMETA(opaque) check to the original function, which
> seems like a useful defense against page written to a different place (which
> is essentially the issue I was originally investigating).
>

It seems like the latest patch(v4) shared by Tomas upthread is an
empty patch. If I am not wrong, please share the correct patch.
Thanks.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-03-31 Thread Ashutosh Sharma

On Sat, Apr 1, 2017 at 1:08 AM, Robert Haas  wrote:
> On Mon, Mar 27, 2017 at 5:39 AM, Ashutosh Sharma  
> wrote:
>> Thanks for reporting this problem. Could you please let me know on for
>> how long did you run sqlsmith to get this crash. However, I have found
>> the reason for this crash. This is basically happening when trying to
>> retrieve the tuples using cursor. Basically the current hash index
>> scan work tuple-at-a-time which means once it finds tuple on page, it
>> releases lock from the page but keeps pin on it and finally returns
>> the tuple. When the requested number of tuples are processed there is
>> no lock on the page that was being scanned but yes there is a pin on
>> it. Finally, when trying to close a cursor at the end of scan, if any
>> killed tuples has been identified we try to first mark these items as
>> dead with the help of _hash_kill_items(). But, since we only have pin
>> on this page, the assert check 'LWLockHeldByMe()' fails.
>
> Instead of adding bool haveLock to _hash_kill_items, how about just
> having the caller acquire the lock before calling the function and
> release it afterwards?  Since the acquire is at the beginning of the
> function and the release at the end, there seems to be no point in
> putting the acquire/release inside the function rather than in the
> caller.

Well, That is another option but the main idea was to be inline with
the btree code. Moreover, I am not sure if acquiring lwlock inside
hashendscan (basically the place where we are trying to close down the
things) would look good.

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-03-31 Thread Ashutosh Sharma

On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas  wrote:
> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma  
> wrote:
>> Well, That is another option but the main idea was to be inline with
>> the btree code.
>
> That's not a bad goal in principal, but _bt_killitems() doesn't have
> any similar argument.

It was there but later got removed with some patch (may be the patch
for reducing pinning and buffer content lock for btree scans). The
following commit has this changes.

commit 09cb5c0e7d6fbc9dee26dc429e4fc0f2a88e5272
Author: Tom Lane 
Date:   Sun May 7 01:21:30 2006 +

>
> (Also, speaking of consistency, why did we end up with
> _hash_kill_items, with an underscore between kill and items, and
> _bt_killitems, without one?)

That is just to follow the naming convention in hash.h Please check
the review comments for this at - [1].

>
>> Moreover, I am not sure if acquiring lwlock inside
>> hashendscan (basically the place where we are trying to close down the
>> things) would look good.
>
> Well, if that's not a good thing to do, hiding it inside some other
> function doesn't make it better.

okay, agreed. I will submit the patch very shortly.


[1] - 
https://www.postgresql.org/message-id/cf6abd8a-77b5-f1c7-8e50-5bef461e0522%40redhat.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [sqlsmith] Failed assertion in _hash_kill_items/MarkBufferDirtyHint

2017-03-31 Thread Ashutosh Sharma

Hi,

On Sat, Apr 1, 2017 at 7:06 AM, Ashutosh Sharma  wrote:
> On Sat, Apr 1, 2017 at 6:00 AM, Robert Haas  wrote:
>> On Fri, Mar 31, 2017 at 6:09 PM, Ashutosh Sharma  
>> wrote:
>>> Well, That is another option but the main idea was to be inline with
>>> the btree code.
>>
>> That's not a bad goal in principal, but _bt_killitems() doesn't have
>> any similar argument.
>
> It was there but later got removed with some patch (may be the patch
> for reducing pinning and buffer content lock for btree scans). The
> following commit has this changes.
>
> commit 09cb5c0e7d6fbc9dee26dc429e4fc0f2a88e5272
> Author: Tom Lane 
> Date:   Sun May 7 01:21:30 2006 +
>
>>
>> (Also, speaking of consistency, why did we end up with
>> _hash_kill_items, with an underscore between kill and items, and
>> _bt_killitems, without one?)
>
> That is just to follow the naming convention in hash.h Please check
> the review comments for this at - [1].
>
>>
>>> Moreover, I am not sure if acquiring lwlock inside
>>> hashendscan (basically the place where we are trying to close down the
>>> things) would look good.
>>
>> Well, if that's not a good thing to do, hiding it inside some other
>> function doesn't make it better.
>
> okay, agreed. I will submit the patch very shortly.

As suggested, I am now acquiring lock inside the caller function.
Attached is the patch having this changes. Thanks.

>
>
> [1] - 
> https://www.postgresql.org/message-id/cf6abd8a-77b5-f1c7-8e50-5bef461e0522%40redhat.com

-- 
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index 34cc08f..b835f77 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -476,9 +476,17 @@ hashrescan(IndexScanDesc scan, ScanKey scankey, int nscankeys,
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* Before leaving current page, deal with any killed items */
+	/*
+	 * Before leaving current page, deal with any killed items.
+	 * Also, ensure that we acquire lock on current page before
+	 * calling _hash_kill_items.
+	 */
 	if (so->numKilled > 0)
+	{
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
 		_hash_kill_items(scan);
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+	}
 
 	_hash_dropscanbuf(rel, so);
 
@@ -507,9 +515,17 @@ hashendscan(IndexScanDesc scan)
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
 	Relation	rel = scan->indexRelation;
 
-	/* Before leaving current page, deal with any killed items */
+	/*
+	 * Before leaving current page, deal with any killed items.
+	 * Also, ensure that we acquire lock on current page before
+	 * calling _hash_kill_items.
+	 */
 	if (so->numKilled > 0)
+	{
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
 		_hash_kill_items(scan);
+		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_UNLOCK);
+	}
 
 	_hash_dropscanbuf(rel, so);
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Page Scan Mode in Hash Index

2017-04-01 Thread Ashutosh Sharma

Hi,

>
> On 03/29/2017 09:16 PM, Ashutosh Sharma wrote:
>>>
>>> This patch needs a rebase.
>>
>>
>> Please try applying these patches on top of [1]. I think you should be
>> able
>> to apply it cleanly. Sorry, I think I forgot to mention this point in my
>> earlier mail.
>>
>> [1] -
>>
>> https://www.postgresql.org/message-id/CAE9k0P%3DV2LhtyeMXd295fhisp%3DNWUhRVJ9EZQCDowWiY9rSohQ%40mail.gmail.com
>>
>
> Thanks, that works !
>
> As you have provided a patch for Robert's comments, and no other review have
> been posted I'm moving this patch to "Ready for Committer" for additional
> committer feedback.

Please find the attached new version of patches for page scan mode in
hash index rebased on top of - [1].

[1] - 
https://www.postgresql.org/message-id/CAE9k0P%3D3rtgUtxopG%2BXQVxgASFzAnGd9sNmYUaj_%3DKeVsKGUdA%40mail.gmail.com
From 498723199f4b14ff9917aca13abf30f9ea261ca7 Mon Sep 17 00:00:00 2001
From: ashu 
Date: Sat, 1 Apr 2017 12:09:46 +0530
Subject: [PATCH] Rewrite hash index scans to work a page at a timev5

Patch by Ashutosh Sharma
---
 src/backend/access/hash/README   |   9 +-
 src/backend/access/hash/hash.c   | 140 ++-
 src/backend/access/hash/hashpage.c   |  17 +-
 src/backend/access/hash/hashsearch.c | 441 +++
 src/backend/access/hash/hashutil.c   |  29 ++-
 src/include/access/hash.h|  44 
 6 files changed, 509 insertions(+), 171 deletions(-)

diff --git a/src/backend/access/hash/README b/src/backend/access/hash/README
index 1541438..f0a7bdf 100644
--- a/src/backend/access/hash/README
+++ b/src/backend/access/hash/README
@@ -243,10 +243,11 @@ The reader algorithm is:
 -- then, per read request:
 	reacquire content lock on current page
 	step to next page if necessary (no chaining of content locks, but keep
- the pin on the primary bucket throughout the scan; we also maintain
- a pin on the page currently being scanned)
-	get tuple
-	release content lock
+ the pin on the primary bucket throughout the scan)
+save all the matching tuples from current index page into an items array
+release pin and content lock (but if it is primary bucket page retain
+it's pin till the end of scan)
+get tuple from an item array
 -- at scan shutdown:
 	release all pins still held
 
diff --git a/src/backend/access/hash/hash.c b/src/backend/access/hash/hash.c
index b835f77..bd2827a 100644
--- a/src/backend/access/hash/hash.c
+++ b/src/backend/access/hash/hash.c
@@ -268,66 +268,23 @@ bool
 hashgettuple(IndexScanDesc scan, ScanDirection dir)
 {
 	HashScanOpaque so = (HashScanOpaque) scan->opaque;
-	Relation	rel = scan->indexRelation;
-	Buffer		buf;
-	Page		page;
 	OffsetNumber offnum;
-	ItemPointer current;
 	bool		res;
+	HashScanPosItem	*currItem;
 
 	/* Hash indexes are always lossy since we store only the hash code */
 	scan->xs_recheck = true;
 
 	/*
-	 * We hold pin but not lock on current buffer while outside the hash AM.
-	 * Reacquire the read lock here.
-	 */
-	if (BufferIsValid(so->hashso_curbuf))
-		LockBuffer(so->hashso_curbuf, BUFFER_LOCK_SHARE);
-
-	/*
 	 * If we've already initialized this scan, we can just advance it in the
 	 * appropriate direction.  If we haven't done so yet, we call a routine to
 	 * get the first item in the scan.
 	 */
-	current = &(so->hashso_curpos);
-	if (ItemPointerIsValid(current))
+	if (!HashScanPosIsValid(so->currPos))
+		res = _hash_first(scan, dir);
+	else
 	{
 		/*
-		 * An insertion into the current index page could have happened while
-		 * we didn't have read lock on it.  Re-find our position by looking
-		 * for the TID we previously returned.  (Because we hold a pin on the
-		 * primary bucket page, no deletions or splits could have occurred;
-		 * therefore we can expect that the TID still exists in the current
-		 * index page, at an offset >= where we were.)
-		 */
-		OffsetNumber maxoffnum;
-
-		buf = so->hashso_curbuf;
-		Assert(BufferIsValid(buf));
-		page = BufferGetPage(buf);
-
-		/*
-		 * We don't need test for old snapshot here as the current buffer is
-		 * pinned, so vacuum can't clean the page.
-		 */
-		maxoffnum = PageGetMaxOffsetNumber(page);
-		for (offnum = ItemPointerGetOffsetNumber(current);
-			 offnum <= maxoffnum;
-			 offnum = OffsetNumberNext(offnum))
-		{
-			IndexTuple	itup;
-
-			itup = (IndexTuple) PageGetItem(page, PageGetItemId(page, offnum));
-			if (ItemPointerEquals(&(so->hashso_heappos), &(itup->t_tid)))
-break;
-		}
-		if (offnum > maxoffnum)
-			elog(ERROR, "failed to re-find scan position within index \"%s\"",
- RelationGetRelationName(rel));
-		ItemPointerSetOffsetNumber(current, offnum);
-
-		/*
 		 * Check to see if we should kill the previously-fetched tuple.
 		 */
 		if (scan->kill_prior_tup

Re: [HACKERS] PATCH: pageinspect / add page_checksum and bt_page_items(bytea)

2017-04-01 Thread Ashutosh Sharma

Hi Tomas,

On Fri, Mar 31, 2017 at 11:05 PM, Tomas Vondra 
wrote:
> On 03/31/2017 06:01 PM, Ashutosh Sharma wrote:
>>
>>
>> It seems like the latest patch(v4) shared by Tomas upthread is an
>> empty patch. If I am not wrong, please share the correct patch.
>> Thanks.
>>
>
> OMG, this is the second time I managed to generate an empty patch. I
really
> need to learn not to do that ..
>
> Attached is v5 of the patch, hopefully correct this time ;-)

Yes, it was correct. I have spent some time on reviewing your patch today
and the review comments are as follows.

1. It would be good to have a blank line in between following set of lines
describing bt_page_print_tuples.

+/*
+ * bt_page_print_tuples
+ * form a tuple describing index tuple at a given offset
+ */

Something like this,

+/* -
+ * bt_page_print_tuples()
+ *
+ * Form a tuple describing index tuple at a given offset
+ * -
+ */

Please note that, if you do not agree with my suggestion, you may ignore it
:)

2. I think the following comments for bt_page_items needs to be moved to
the right place in the code.

/*-
 * bt_page_items()
 *
 * Get IndexTupleData set in a btree page
 *
 * Usage: SELECT * FROM bt_page_items('t1_pkey', 1);
 *-
 */

/*
 * cross-call data structure for SRF
 */

*struct user_args{Pagepage;OffsetNumber offset;};*

With your patch, above structure definition is followed by the definition
for bt_page_print_tuples() not bt_page_items(), hence you may need to put
the comments for bt_page_items just above it's definition.

3. I am seeing a server crash when running the sql statement 'SELECT * FROM
bt_page_items('bt_i4_index', 1);'. Here is the backtrace.

I think this crash is happening because in bt_page_items, without reading
opaque region of a page, you have added this if statement,

*if (P_ISMETA(opaque))ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),  errmsg("block
is a meta page")));*

Please pull back above if statement below the following line to get rid of
this crash.

opaque = (BTPageOpaque) PageGetSpecialPointer(uargs->page);

OR, the best thing to do would be to retain the earlier if statement for
checking meta page because that way you can avoid reading a
buffer unnecessarily.

#0  0x7f592975030c in bt_page_items (fcinfo=0x7fffe5fdb860) at
btreefuncs.c:352
352 if (P_ISMETA(opaque))
Missing separate debuginfos, use: debuginfo-install glibc-2.17-78.el7.x86_64
(gdb) bt
#0  0x7f592975030c in bt_page_items (fcinfo=0x7fffe5fdb860) at
btreefuncs.c:352
#1  0x006797e0 in ExecMakeTableFunctionResult (setexpr=0x21269f8,
econtext=0x2126738, argContext=0x2136c98, expectedDesc=0x2155178,
randomAccess=0 '\000') at execSRF.c:230
#2  0x00689dda in FunctionNext (node=0x2126628) at
nodeFunctionscan.c:94
#3  0x00678f0e in ExecScanFetch (node=0x2126628, accessMtd=0x689d23
, recheckMtd=0x68a108 ) at execScan.c:95
#4  0x00678f7d in ExecScan (node=0x2126628, accessMtd=0x689d23
, recheckMtd=0x68a108 ) at execScan.c:143

4. If you agree with the reason mentioned by me in comment #3, please
remove the following if statement from bt_page_items(),

*if (P_ISMETA(opaque))ereport(ERROR,
(errcode(ERRCODE_INVALID_PARAMETER_VALUE),  errmsg("block
is a meta page")));*

Apart from above comments, your patch looks good to me. I have also marked
this patch as 'Waiting for Author' in the commitfest. Thanks.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

1 2 3 >

1 - 100 of 230 matches

Mail list logo