Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-09-20 Thread Mithun Cy
On Thu, Sep 21, 2017 at 7:25 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Wed, Sep 20, 2017 at 9:46 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
>> I think there is some confusion above results is for pgbench simple update
>> (-N) tests where cached snapshot gets invalidated, I have run this to check
>> if there is any regression due to frequent cache invalidation and did not
>> find any. The target test for the above patch is read-only case [1] where we
>> can see the performance improvement as high as 39% (@256 threads) on
>> Cthulhu(a 8 socket numa machine with 64 CPU cores). At 64 threads ( = CPU
>> cores)  we have 5% improvement and at clients 128 = (2 * CPU cores =
>> hyperthreads) we have 17% improvement.
>>
>> Clients  BASE CODEWith patch   %Imp
>>
>> 64  452475.929144476195.952736 5.2422730281
>>
>> 128556207.727932653256.02901217.4482115595
>>
>> 256494336.282804 691614.00046339.9075941867
>
> Oh, you're right.  I was confused.
>
> But now I'm confused about something else: if you're seeing a clear
> gain at higher-client counts, why is Jesper Pederson not seeing the
> same thing?  Do you have results for a 2-socket machine?  Maybe this
> only helps with >2 sockets.

My current tests show on scylla (2 socket machine with 28 CPU core) I
do not see any improvement at all as similar to Jesper. But same tests
on power2 (4 sockets) and Cthulhu(8 socket machine) we can see
improvements.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-09-20 Thread Mithun Cy
On Wed, Sep 20, 2017 at 11:45 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Tue, Aug 29, 2017 at 1:57 AM, Mithun Cy <mithun...@enterprisedb.com>
wrote:
>> All TPS are median of 3 runs
>> Clients TPS-With Patch 05   TPS-Base%Diff
>> 1 752.461117755.186777  -0.3%
>> 64   32171.296537   31202.153576   +3.1%
>> 128 41059.660769   40061.929658   +2.49%
>>
>> I will do some profiling and find out why this case is not costing us
>> some performance due to caching overhead.
>
> So, this shows only a 2.49% improvement at 128 clients but in the
> earlier message you reported a 39% speedup at 256 clients.  Is that
> really correct?  There's basically no improvement up to threads = 2 x
> CPU cores, and then after that it starts to improve rapidly?  What
> happens at intermediate points, like 160, 192, 224 clients?

I think there is some confusion above results is for pgbench simple update
(-N) tests where cached snapshot gets invalidated, I have run this to check
if there is any regression due to frequent cache invalidation and did not
find any. The target test for the above patch is read-only case [1] where
we can see the performance improvement as high as 39% (@256 threads) on
Cthulhu(a 8 socket numa machine with 64 CPU cores). At 64 threads ( = CPU
cores)  we have 5% improvement and at clients 128 = (2 * CPU cores =
hyperthreads) we have 17% improvement.

Clients  BASE CODEWith patch   %Imp

64  452475.929144476195.952736 5.2422730281

128556207.727932653256.02901217.4482115595

256494336.282804 691614.00046339.9075941867


[1] cache_the_snapshot_performance.ods
<https://www.postgresql.org/message-id/CAD__Oui=q++3er0pu2hanjqe7ngtohwxypxxjqvsjqns91l...@mail.gmail.com>
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with a lot of columns

2017-09-20 Thread Mithun Cy
On Mon, Sep 18, 2017 at 1:38 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> On Sat, Sep 16, 2017 at 3:03 AM, Andres Freund <and...@anarazel.de> wrote:
> So I think performance gain is visible. We saved a good amount of
> execution cycle in SendRowDescriptionMessagewhen(my callgrind report
> confirmed same) when we project a large number of columns in the query
> with these new patches.

I have tested patch, for me, patch looks good and can see improvement
in performance as a number of columns projected increases in the
query. There appear some cosmetic issues(pgindent issues + end of file
cr) in the patch if it can be considered as a valid issue they need
changes. Rest look okay for me.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-09-14 Thread Mithun Cy
On Wed, Sep 13, 2017 at 7:24 PM, Jesper Pedersen
 wrote:
> I have done a run with this patch on a 2S/28C/56T/256Gb w/ 2 x RAID10 SSD
> machine.
>
> Both for -M prepared, and -M prepared -S I'm not seeing any improvements (1
> to 375 clients); e.g. +-1%.

My test was done on an 8 socket NUMA intel machine, where I could
clearly see improvements as posted before.
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   1064.000
BogoMIPS:  4266.62
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

Let me recheck if the improvement is due to NUMA or cache sizes.
Currently above machine is not available for me. It will take 2 more
days for same.

> Although the -M prepared -S case should improve, I'm not sure that the extra
> overhead in the -M prepared case is worth the added code complexity.

Each tpcb like (-M prepared) transaction in pgbench have 3 updates 1
insert and 1 select statements. There will be more GetSnapshotData
calls than the end of the transaction (cached snapshot invalidation).
So I think whatever we cache has a higher chance of getting used
before it is invalidated in -M prepared. But on worst cases where we
have quick write transaction which invalidates cached snapshot before
it is reused becomes an overhead.
As Andres has suggested previously I need a mechanism to identify and
avoid caching for such scenarios. I do not have a right solution for
this at present but one thing we can try is just before caching if we
see an exclusive request in wait queue of ProcArrayLock we can avoid
caching.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-29 Thread Mithun Cy
On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund  wrote:
>I think this patch should have a "cover letter" explaining what it's
> trying to achieve, how it does so and why it's safe/correct.

Cache the SnapshotData for reuse:
===
In one of our perf analysis using perf tool it showed GetSnapshotData
takes a very high percentage of CPU cycles on readonly workload when
there is very high number of concurrent connections >= 64.

Machine : cthulhu 8 node machine.
--
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   2128.000
BogoMIPS:  4266.63
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

On further analysis, it appears this mainly accounts to cache-misses
happening while iterating through large proc array to compute the
SnapshotData. Also, it seems mainly on read-only (read heavy) workload
SnapshotData mostly remains same as no new(rarely a) transaction
commits or rollbacks. Taking advantage of this we can save the
previously computed SanspshotData in shared memory and in
GetSnapshotData we can use the saved SnapshotData instead of computing
same when it is still valid. A saved SnapsotData is valid until no
open transaction end after it.

[Perf cache-misses analysis of Base for 128 pgbench readonly clients]
==
Samples: 421K of event 'cache-misses', Event count (approx.): 160069274
Overhead  Command   Shared Object   Symbol
18.63%  postgres  postgres[.] GetSnapshotData
11.98%  postgres  postgres[.] _bt_compare
10.20%  postgres  postgres[.] PinBuffer
  8.63%  postgres  postgres[.] LWLockAttemptLock
6.50%  postgres  postgres[.] LWLockRelease


[Perf cache-misses analysis with Patch for 128 pgbench readonly clients]

Samples: 357K of event 'cache-misses', Event count (approx.): 102380622
Overhead  Command   Shared Object   Symbol
0.27%  postgres  postgres[.] GetSnapshotData

with the patch, it appears cache-misses events has reduced significantly.

Test Setting:
=
Server configuration:
./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
wal_buffers=256MB &

pgbench configuration:
scale_factor = 300
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S  postgres

The machine has 64 cores with this patch I can see server starts
improvement after 64 clients. I have tested up to 256 clients. Which
shows performance improvement nearly max 39% [1]. This is the best
case for the patch where once computed snapshotData is reused further.

The worst case seems to be small, quick write transactions which
frequently invalidate the cached SnapshotData before it is reused by
any next GetSnapshotData call. As of now, I tested simple update
tests: pgbench -M Prepared -N on the same machine with the above
server configuration. I do not see much change in TPS numbers.

All TPS are median of 3 runs
Clients TPS-With Patch 05   TPS-Base%Diff
1 752.461117755.186777  -0.3%
64   32171.296537   31202.153576   +3.1%
128 41059.660769   40061.929658   +2.49%

I will do some profiling and find out why this case is not costing us
some performance due to caching overhead.

>> diff --git a/src/backend/storage/ipc/procarray.c 
>> b/src/backend/storage/ipc/procarray.c
>> index a7e8cf2..4d7debe 100644
>> --- a/src/backend/storage/ipc/procarray.c
>> +++ b/src/backend/storage/ipc/procarray.c
>> @@ -366,11 +366,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
>>   (arrayP->numProcs - index - 1) * 
>> sizeof(int));
>>   arrayP->pgprocnos[arrayP->numProcs - 1] = -1;   /* for 
>> debugging */
>>   arrayP->numProcs--;
>> + ProcGlobal->is_snapshot_valid = false;
>>   LWLockRelease(ProcArrayLock);
>>   return;
>>

Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-29 Thread Mithun Cy
Hi all please ignore this mail, this email is incomplete I have to add
more information and Sorry accidentally I pressed the send button
while replying.


On Tue, Aug 29, 2017 at 11:27 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> Cache the SnapshotData for reuse:
> ===
> In one of our perf analysis using perf tool it showed GetSnapshotData
> takes a very high percentage of CPU cycles on readonly workload when
> there is very high number of concurrent connections >= 64.
>
> Machine : cthulhu 8 node machine.
> --
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):128
> On-line CPU(s) list:   0-127
> Thread(s) per core:2
> Core(s) per socket:8
> Socket(s): 8
> NUMA node(s):  8
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 47
> Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
> Stepping:  2
> CPU MHz:   2128.000
> BogoMIPS:  4266.63
> Virtualization:VT-x
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  24576K
> NUMA node0 CPU(s): 0,65-71,96-103
> NUMA node1 CPU(s): 72-79,104-111
> NUMA node2 CPU(s): 80-87,112-119
> NUMA node3 CPU(s): 88-95,120-127
> NUMA node4 CPU(s): 1-8,33-40
> NUMA node5 CPU(s): 9-16,41-48
> NUMA node6 CPU(s): 17-24,49-56
> NUMA node7 CPU(s): 25-32,57-64
>
> Perf CPU cycle 128 clients.
>
> On further analysis, it appears this mainly accounts to cache-misses
> happening while iterating through large proc array to compute the
> SnapshotData. Also, it seems mainly on read-only (read heavy) workload
> SnapshotData mostly remains same as no new(rarely a) transaction
> commits or rollbacks. Taking advantage of this we can save the
> previously computed SanspshotData in shared memory and in
> GetSnapshotData we can use the saved SnapshotData instead of computing
> same when it is still valid. A saved SnapsotData is valid until no
> transaction end.
>
> [Perf analysis Base]
>
> Samples: 421K of event 'cache-misses', Event count (approx.): 160069274
> Overhead  Command   Shared Object   Symbol
> 18.63%  postgres  postgres[.] GetSnapshotData
> 11.98%  postgres  postgres[.] _bt_compare
> 10.20%  postgres  postgres[.] PinBuffer
>   8.63%  postgres  postgres[.] LWLockAttemptLock
> 6.50%  postgres  postgres[.] LWLockRelease
>
>
> [Perf analysis with Patch]
>
>
> Server configuration:
> ./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
> max_wal_size=20GB -c checkpoint_timeout=900 -c
> maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
> wal_buffers=256MB &
>
> pgbench configuration:
> scale_factor = 300
> ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S  
> postgres
>
> The machine has 64 cores with this patch I can see server starts
> improvement after 64 clients. I have tested up to 256 clients. Which
> shows performance improvement nearly max 39% [1]. This is the best
> case for the patch where once computed snapshotData is reused further.
>
> The worst case seems to be small, quick write transactions with which
> frequently invalidate the cached SnapshotData before it is reused by
> any next GetSnapshotData call. As of now, I tested simple update
> tests: pgbench -M Prepared -N on the same machine with the above
> server configuration. I do not see much change in TPS numbers.
>
> All TPS are median of 3 runs
> Clients TPS-With Patch 05   TPS-Base%Diff
> 1 752.461117755.186777  -0.3%
> 64   32171.296537   31202.153576   +3.1%
> 128 41059.660769   40061.929658   +2.49%
>
> I will do some profiling and find out why this case is not costing us
> some performance due to caching overhead.
>
>
> []
>
> On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund <and...@anarazel.de> wrote:
>> Hi,
>>
>> I think this patch should have a "cover letter" explaining what it's
>> trying to achieve, how it does so and why it's safe/correct.  I think
>> it'd also be good to try to show some of the worst cases of this patch
>> (i.e. where the caching only adds overhead, but never gets used), and
>> some of the best cases (where the cache gets used quite often, and
>> reduces contention significantly).
>>
>> I wonder if there's a way to avoid copying the snapshot into the 

Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-28 Thread Mithun Cy
Cache the SnapshotData for reuse:
===
In one of our perf analysis using perf tool it showed GetSnapshotData
takes a very high percentage of CPU cycles on readonly workload when
there is very high number of concurrent connections >= 64.

Machine : cthulhu 8 node machine.
--
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   2128.000
BogoMIPS:  4266.63
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

Perf CPU cycle 128 clients.

On further analysis, it appears this mainly accounts to cache-misses
happening while iterating through large proc array to compute the
SnapshotData. Also, it seems mainly on read-only (read heavy) workload
SnapshotData mostly remains same as no new(rarely a) transaction
commits or rollbacks. Taking advantage of this we can save the
previously computed SanspshotData in shared memory and in
GetSnapshotData we can use the saved SnapshotData instead of computing
same when it is still valid. A saved SnapsotData is valid until no
transaction end.

[Perf analysis Base]

Samples: 421K of event 'cache-misses', Event count (approx.): 160069274
Overhead  Command   Shared Object   Symbol
18.63%  postgres  postgres[.] GetSnapshotData
11.98%  postgres  postgres[.] _bt_compare
10.20%  postgres  postgres[.] PinBuffer
  8.63%  postgres  postgres[.] LWLockAttemptLock
6.50%  postgres  postgres[.] LWLockRelease


[Perf analysis with Patch]


Server configuration:
./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
wal_buffers=256MB &

pgbench configuration:
scale_factor = 300
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S  postgres

The machine has 64 cores with this patch I can see server starts
improvement after 64 clients. I have tested up to 256 clients. Which
shows performance improvement nearly max 39% [1]. This is the best
case for the patch where once computed snapshotData is reused further.

The worst case seems to be small, quick write transactions with which
frequently invalidate the cached SnapshotData before it is reused by
any next GetSnapshotData call. As of now, I tested simple update
tests: pgbench -M Prepared -N on the same machine with the above
server configuration. I do not see much change in TPS numbers.

All TPS are median of 3 runs
Clients TPS-With Patch 05   TPS-Base%Diff
1 752.461117755.186777  -0.3%
64   32171.296537   31202.153576   +3.1%
128 41059.660769   40061.929658   +2.49%

I will do some profiling and find out why this case is not costing us
some performance due to caching overhead.


[]

On Thu, Aug 3, 2017 at 2:16 AM, Andres Freund <and...@anarazel.de> wrote:
> Hi,
>
> I think this patch should have a "cover letter" explaining what it's
> trying to achieve, how it does so and why it's safe/correct.  I think
> it'd also be good to try to show some of the worst cases of this patch
> (i.e. where the caching only adds overhead, but never gets used), and
> some of the best cases (where the cache gets used quite often, and
> reduces contention significantly).
>
> I wonder if there's a way to avoid copying the snapshot into the cache
> in situations where we're barely ever going to need it. But right now
> the only way I can think of is to look at the length of the
> ProcArrayLock wait queue and count the exclusive locks therein - which'd
> be expensive and intrusive...
>
>
> On 2017-08-02 15:56:21 +0530, Mithun Cy wrote:
>> diff --git a/src/backend/storage/ipc/procarray.c 
>> b/src/backend/storage/ipc/procarray.c
>> index a7e8cf2..4d7debe 100644
>> --- a/src/backend/storage/ipc/procarray.c
>> +++ b/src/backend/storage/ipc/procarray.c
>> @@ -366,11 +366,13 @@ ProcArrayRemove(PGPROC *proc, TransactionId latestXid)
>>   (arrayP->numProcs - index

Re: [HACKERS] NoMovementScanDirection in heapgettup() and heapgettup_pagemode()

2017-08-28 Thread Mithun Cy
On Mon, Aug 28, 2017 at 5:50 PM, Tom Lane  wrote:
> I think that's probably dead code given that ExecutorRun short-circuits
> everything for NoMovementScanDirection.  There is some use of
> NoMovementScanDirection for indexscans, to denote an unordered index,
> but likely that could be got rid of too.  That would leave us with
> NoMovementScanDirection having no other use except as a do-nothing
> flag for ExecutorRun.
>
> regards, tom lane

Thanks, Tom, yes it seems to be a dead code.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] NoMovementScanDirection in heapgettup() and heapgettup_pagemode()

2017-08-28 Thread Mithun Cy
Hi all,
I was trying to study NoMovementScanDirection part of heapgettup() and
heapgettup_pagemode(). If I am right there is no test in test suit to
hit this code. I did run make check-world could not hit it. Also,
coverage report in
https://coverage.postgresql.org/src/backend/access/heap/heapam.c.gcov.html
shows this part of the code is not hit. Can somebody please help me to
understand this part of the code and how to test same?

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-08-21 Thread Mithun Cy
On Fri, Aug 18, 2017 at 9:43 PM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Aug 18, 2017 at 2:23 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> Ah, good catch.  While I agree that there is probably no great harm
> from skipping the lock here, I think it would be better to just avoid
> throwing an error while we hold the lock.  I think apw_dump_now() is
> the only place where that could happen, and in the attached version,
> I've fixed it so it doesn't do that any more.  Independent of the
> correctness issue, I think the code is easier to read this way.
>
> I also realize that it's not formally sufficient to use
> PG_TRY()/PG_CATCH() here, because a FATAL would leave us in a bad
> state.  Changed to PG_ENSURE_ERROR_CLEANUP().
>
>> 2) I also found one issue which was my own mistake in my previous patch 19.
>> In "apw_dump_now" I missed calling FreeFile() on first write error,
>> whereas on othercases I am already calling the same.
>> ret = fprintf(file, "<<" INT64_FORMAT ">>\n", num_blocks);
>> + if (ret < 0)
>> + {
>> + int save_errno = errno;
>> +
>> + unlink(transient_dump_file_path);
>
> Changed in the attached version.

Thanks for the patch, I have tested the above fix now it works as
described. From my test patch looks good, I did not find any other
issues.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-08-18 Thread Mithun Cy
On Wed, Aug 16, 2017 at 2:08 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Fri, Jul 14, 2017 at 8:17 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
>> [ new patch ]
> It's quite possible that in making all of these changes I've
> introduced some bugs, so I think this needs some testing and review.
> It's also possible that some of the changes that I made are actually
> not improvements and should be reverted, but it's always hard to tell
> that about your own code.  Anyway, please see the attached version.

Sorry, Robert, I was on vacation so could not pick this immediately. I
have been testing and reviewing the patch and I found following
issues.

1. Hang in apw_detach_shmem.
+/*
+ * Clear our PID from autoprewarm shared state.
+ */
+static void
+apw_detach_shmem(int code, Datum arg)
+{
+   LWLockAcquire(_state->lock, LW_EXCLUSIVE);
+   if (apw_state->pid_using_dumpfile == MyProcPid)
+   apw_state->pid_using_dumpfile = InvalidPid;
+   if (apw_state->bgworker_pid == MyProcPid)
+   apw_state->bgworker_pid = InvalidPid;
+   LWLockRelease(_state->lock);
+}

The reason is that we might already be under the apw_state->lock when
we error out and jump to apw_detach_shmem. So we should not be trying
to take the lock again. For example, in autoprewarm_dump_now(),
apw_dump_now() will error out under the lock if bgworker is already
using dump file.

===
+autoprewarm_dump_now(PG_FUNCTION_ARGS)
+{
+int64 num_blocks;
+
+apw_init_shmem();
+
+PG_TRY();
+{
+ num_blocks = apw_dump_now(false, true);
+}
+PG_CATCH();
+{
+ apw_detach_shmem(0, 0);
+ PG_RE_THROW();
+}
+PG_END_TRY();
+
+PG_RETURN_INT64(num_blocks);
+}

===
+ LWLockAcquire(_state->lock, LW_EXCLUSIVE);
+ if (apw_state->pid_using_dumpfile == InvalidPid)
+ apw_state->pid_using_dumpfile = MyProcPid;
+ else
+ {
+ if (!is_bgworker)
+ ereport(ERROR,
+(errmsg("could not perform block dump because
dump file is being used by PID %d",
+ apw_state->pid_using_dumpfile)));

This attempt to take lock again hangs the autoprewarm module. I think
there is no need to take lock while we reset those variables as we
reset only if we have set it ourselves.

2) I also found one issue which was my own mistake in my previous patch 19.
In "apw_dump_now" I missed calling FreeFile() on first write error,
whereas on othercases I am already calling the same.
ret = fprintf(file, "<<" INT64_FORMAT ">>\n", num_blocks);
+ if (ret < 0)
+ {
+ int save_errno = errno;
+
+ unlink(transient_dump_file_path);

Other than this, the patch is working as it was previously doing. If
you think my presumed fix(not to take lock) to hang issue is right I
will produce a patch for the same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-03 Thread Mithun Cy
On 3 Aug 2017 2:16 am, "Andres Freund"  wrote:

Hi Andres thanks for detailed review. I agree with all of the comments. I
am going for a vacation. Once I come back I will fix those comments and
will submit a new patch.


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-02 Thread Mithun Cy
On Wed, Aug 2, 2017 at 3:42 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
Sorry, there was an unnecessary header included in proc.c which should
be removed adding the corrected patch.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Cache_data_in_GetSnapshotData_POC_04.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-08-02 Thread Mithun Cy
I have made few more changes with the new patch.

1. Ran pgindent.
2. Instead of an atomic state variable to make only one process cache
the snapshot in shared memory, I have used conditional try lwlock.
With this, we have a small and reliable code.
3. Performance benchmarking

Machine - cthulhu
==
[mithun.cy@cthulhu bin]$ lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):128
On-line CPU(s) list:   0-127
Thread(s) per core:2
Core(s) per socket:8
Socket(s): 8
NUMA node(s):  8
Vendor ID: GenuineIntel
CPU family:6
Model: 47
Model name:Intel(R) Xeon(R) CPU E7- 8830  @ 2.13GHz
Stepping:  2
CPU MHz:   1197.000
BogoMIPS:  4266.63
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  24576K
NUMA node0 CPU(s): 0,65-71,96-103
NUMA node1 CPU(s): 72-79,104-111
NUMA node2 CPU(s): 80-87,112-119
NUMA node3 CPU(s): 88-95,120-127
NUMA node4 CPU(s): 1-8,33-40
NUMA node5 CPU(s): 9-16,41-48
NUMA node6 CPU(s): 17-24,49-56
NUMA node7 CPU(s): 25-32,57-64

Server configuration:
./postgres -c shared_buffers=8GB -N 300 -c min_wal_size=15GB -c
max_wal_size=20GB -c checkpoint_timeout=900 -c
maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 -c
wal_buffers=256MB &

pgbench configuration:
scale_factor = 300
./pgbench -c $threads -j $threads -T $time_for_reading -M prepared -S  postgres

The machine has 64 cores with this patch I can see server starts
improvement after 64 clients. I have tested up to 256 clients. Which
shows performance improvement nearly max 39%.

Alternatively, I thought instead of storing the snapshot in a shared
memory each backend can hold on to its previously computed snapshot
until next commit/rollback happens in the system. We can have a global
counter value associated with the snapshot when ever it is computed.
Upon any new end of the transaction, the global counter will be
incremented. So when a process wants a new snapshot it can compare
these 2 values to check if it can use previously computed snapshot.
This makes code significantly simple. With the first approach, one
process has to compute and store the snapshot for every end of the
transaction and others can reuse the cached the snapshot.  In the
second approach, every process has to re-compute the snapshot. So I am
keeping with the same approach.

On Mon, Jul 10, 2017 at 10:13 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> On Fri, Apr 8, 2016 at 12:13 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> I think that we really shouldn't do anything about this patch until
>> after the CLOG stuff is settled, which it isn't yet.  So I'm going to
>> mark this Returned with Feedback; let's reconsider it for 9.7.
>
> I am updating a rebased patch have tried to benchmark again could see
> good improvement in the pgbench read-only case at very high clients on
> our cthulhu (8 nodes, 128 hyper thread machines) and power2 (4 nodes,
> 192 hyper threads) machine. There is some issue with base code
> benchmarking which is somehow not consistent so once I could figure
> out what is the issue with that I will update
>
> --
> Thanks and Regards
> Mithun C Y
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_the_snapshot_performance.ods
Description: application/vnd.oasis.opendocument.spreadsheet


Cache_data_in_GetSnapshotData_POC_03.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-14 Thread Mithun Cy
On Thu, Jul 6, 2017 at 10:52 AM, Amit Kapila  wrote:
>>> 3.
>> -- I do not think that is true pages of the unlogged table are also
>> read into buffers for read-only purpose. So if we miss to dump them
>> while we shut down then the previous dump should be used.
>>
>
> I am not able to understand what you want to say. Unlogged tables
> should be empty in case of crash recovery.  Also, we never flush
> unlogged buffers except for shutdown checkpoint, refer BufferAlloc and
> in particular below comment:
>
> * Make sure BM_PERMANENT is set for buffers that must be written at every
> * checkpoint.  Unlogged buffers only need to be written at shutdown
> * checkpoints, except for their "init" forks, which need to be treated
> * just like permanent relations.

+ if (buf_state & BM_TAG_VALID &&
+ ((buf_state & BM_PERMANENT) || dump_unlogged))
I have changed it now the final call to dump_now during shutdown or if
called through autoprewarm_dump_now() only we dump blockinfo of
unlogged tables.

>>
>> -- autoprewarm_dump_now is directly called from the backend. In such
>> case, we have to handle signals registered for backend in dump_now().
>> For bgworker dump_block_info_periodically caller of dump_now() handles
>> SIGTERM, SIGUSR1 which we are interested in.
>>
>
> Okay, but what about signal handler for c
> (procsignal_sigusr1_handler).  Have you verified that it will never
> set the InterruptPending flag?

Okay now CHECK_FOR_INTERRUPTS is called for both.

>
>>>
>>> 7.
>>> +dump_now(bool is_bgworker)
>>> {
>>> ..
>>> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
>>>
>>> ..
>>> }
>>>
>>> How will transient_dump_file_path be unlinked in the case of error in
>>> durable_rename?  I think you need to use PG_TRY..PG_CATCH to ensure
>>> same?
>>
>> -- If durable_rename is failing that seems basic functionality of
>> autoperwarm is failing so I want it to be an ERROR. I do not want to
>> remove the temp file as we always truncate before reusing it again. So
>> I think there is no need to catch all ERROR in dump_now() just to
>> remove the temp file.
>>
> I am not getting your argument here, do you mean to say that if
> writing to a transient file is failed then we should remove the
> transient file but if the rename is failed then there is no need to
> remove it?  It sounds strange to me, but maybe you have reason to do
> it like that.

my intention is to unlink when ever possible and when ever control is
within the function. I thought it is okay if we error inside called
function. If temp file is left there that will not be a problem as it
will be reused(truncated first) for next dump. If you think it is
needed I will add a try() catch() around, to catch any error and then
remove the file.
-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


autoprewarm_19.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] POC: Cache data in GetSnapshotData()

2017-07-09 Thread Mithun Cy
On Fri, Apr 8, 2016 at 12:13 PM, Robert Haas  wrote:
> I think that we really shouldn't do anything about this patch until
> after the CLOG stuff is settled, which it isn't yet.  So I'm going to
> mark this Returned with Feedback; let's reconsider it for 9.7.

I am updating a rebased patch have tried to benchmark again could see
good improvement in the pgbench read-only case at very high clients on
our cthulhu (8 nodes, 128 hyper thread machines) and power2 (4 nodes,
192 hyper threads) machine. There is some issue with base code
benchmarking which is somehow not consistent so once I could figure
out what is the issue with that I will update

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Cache_data_in_GetSnapshotData_POC_02.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-06 Thread Mithun Cy
On Thu, Jul 6, 2017 at 10:52 AM, Amit Kapila  wrote:
> I am not able to understand what you want to say. Unlogged tables
> should be empty in case of crash recovery.  Also, we never flush
> unlogged buffers except for shutdown checkpoint, refer BufferAlloc and
> in particular below comment:

-- Sorry I said that because of my lack of knowledge about unlogged
tables. Yes, what you say is right "an unlogged table is automatically
truncated after a crash or unclean shutdown". So it will be enough if
we just dump them during shutdown time.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-05 Thread Mithun Cy
On Mon, Jul 3, 2017 at 3:34 PM, Amit Kapila  wrote:
>
> Few comments on the latest patch:
>
> 1.
> + LWLockRelease(_state->lock);
> + if (!is_bgworker)
> + ereport(ERROR,
> + (errmsg("could not perform block dump because dump file is being
> used by PID %d",
> + apw_state->pid_using_dumpfile)));
> + ereport(LOG,
> + (errmsg("skipping block dump because it is already being performed by PID 
> %d",
> + apw_state->pid_using_dumpfile)));
>
>
> The above code looks confusing as both the messages are saying the
> same thing in different words.  I think you keep one message (probably
> the first one) and decide error level based on if this is invoked for
> bgworker.  Also, you can move LWLockRelease after error message,
> because if there is any error, then it will automatically release all
> lwlocks.

ERROR is used for autoprewarm_dump_now which is called from the backend.
LOG is used for bgworker.
wordings used are to match the context if failing to dump is
acceptable or not. In the case of bgworker, it is acceptable we are
not particular about the start time of dump but the only interval
between the dumps. So if already somebody doing it is acceptable. But
one who calls autoprewarm_dump_now might be particular about the start
time of dump so we throw error making him retry same.

The wording's are suggested by Robert(below snapshot) in one of his
previous comments and I also agree with it. If you think I should
reconsider this and I am missing something I am open to suggestions.

On Wed, May 31, 2017 at 10:18 PM, Robert Haas  wrote:
+If we go to perform
+an immediate dump process and finds a non-zero value already just does
+ereport(ERROR, ...), including the PID of the other process in the
+message (e.g. "unable to perform block dump because dump file is being
+used by PID %d").  In a background worker, if we go to dump and find
+the file in use, log a message (e.g. "skipping block dump because it
+is already being performed by PID %d", "skipping prewarm because block
+dump file is being rewritten by PID %d").

Thanks moved the LWLockRelease after ERROR call.

> 2.
> +autoprewarm_dump_now(PG_FUNCTION_ARGS)
> +{
> + uint32 num_blocks = 0;
> +
> ..
> + PG_RETURN_INT64(num_blocks);
> ..
> }
>
> Is there any reason for using PG_RETURN_INT64 instead of PG_RETURN_UINT32?

Return type autoprewarm_dump_now() is pg_catalog.int8 to accommodate
uint32 so I used PG_RETURN_INT64. I think PG_RETURN_UINT32 can be used
as well I have replaced now.

> 3.
> +dump_now(bool is_bgworker)
> {
> ..
> + if (buf_state & BM_TAG_VALID)
> + {
> + block_info_array[num_blocks].database = bufHdr->tag.rnode.dbNode;
> + block_info_array[num_blocks].tablespace = bufHdr->tag.rnode.spcNode;
> + block_info_array[num_blocks].filenode = bufHdr->tag.rnode.relNode;
> + block_info_array[num_blocks].forknum = bufHdr->tag.forkNum;
> + block_info_array[num_blocks].blocknum = bufHdr->tag.blockNum;
> + ++num_blocks;
> + }
> ..
> }
>I think there is no use of writing Unlogged buffers unless the dump is
>for the shutdown.  You might want to use BufferIsPermanent to detect the same.

-- I do not think that is true pages of the unlogged table are also
read into buffers for read-only purpose. So if we miss to dump them
while we shut down then the previous dump should be used.

> 4.
> +static uint32
> +dump_now(bool is_bgworker)
> {
> ..
> + for (num_blocks = 0, i = 0; i < NBuffers; i++)
> + {
> + uint32 buf_state;
> +
> + if (!is_bgworker)
> + CHECK_FOR_INTERRUPTS();
> ..
> }
>
> Why checking for interrupts is only for non-bgwroker cases?

-- autoprewarm_dump_now is directly called from the backend. In such
case, we have to handle signals registered for backend in dump_now().
For bgworker dump_block_info_periodically caller of dump_now() handles
SIGTERM, SIGUSR1 which we are interested in.

> 5.
> + * Each entry of BlockRecordInfo consists of database, tablespace, filenode,
> + * forknum, blocknum. Note that this is in the text form so that the dump
> + * information is readable and can be edited, if required.
> + */
>
> In the above comment, you are saying that the dump file is in text
> form whereas in the code you are using binary form.  I think code
> should match comments. Is there a reason of preferring binary over
> text or vice versa?

-- Previously I used the write() on file descriptor. Sorry I should
have changed the mode of opening to text mode when I moved the code to
use AllocateFile Sorry fixed same now.

> 6.
> +dump_now(bool is_bgworker)
> {
> ..
> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
> + apw_state->pid_using_dumpfile = InvalidPid;
> ..
> }
>
> How will pid_using_dumpfile be set to InvalidPid in the case of error
> for non-bgworker cases?

-- I have a try() - catch() in autoprewarm_dump_now I think that is okay.

>
> 7.
> +dump_now(bool is_bgworker)
> {
> ..
> + (void) durable_rename(transient_dump_file_path, AUTOPREWARM_FILE, ERROR);
>
> ..
> }
>
> How will 

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-05 Thread Mithun Cy
On Mon, Jul 3, 2017 at 11:58 AM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Sun, Jul 2, 2017 at 10:32 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
>> On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun...@enterprisedb.com> 
>> wrote:
>>> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <t...@linux.com> wrote:
>>>>
>>>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>>>> detect an autoprewarm process already running.  I'd want this to
>>>> return NULL or an error if called for a 2nd time.
>>>
>>> We log instead of error as we try to check only after launching the
>>> worker and inside worker. One solution could be as similar to
>>> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
>>> memory and check if we can launch worker in backend itself. I will try
>>> to fix same.
>>
>> I have fixed it now as follows
>>
>> +Datum
>> +autoprewarm_start_worker(PG_FUNCTION_ARGS)
>> +{
>> +   pid_t   pid;
>> +
>> +   init_apw_shmem();
>> +   pid = apw_state->bgworker_pid;
>> +   if (pid != InvalidPid)
>> +   ereport(ERROR,
>> +   (errmsg("autoprewarm worker is already running under PID %d",
>> +   pid)));
>> +
>> +   autoprewarm_dump_launcher();
>> +   PG_RETURN_VOID();
>> +}
>>
>> In backend itself, we shall check if an autoprewarm worker is running
>> then only start the server. There is a possibility if this function is
>> executed concurrently when there is no worker already running (Which I
>> think is not a normal usage) then both call will say it has
>> successfully launched the worker even though only one could have
>> successfully done that (other will log and silently die).
>
> Why can't we close this remaining race condition?  Basically, if we
> just perform all of the actions in this function under the lock and
> autoprewarm_dump_launcher waits till the autoprewarm worker has
> initialized the bgworker_pid, then there won't be any remaining race
> condition.  I think if we don't close this race condition, it will be
> unpredictable whether the user will get the error or there will be
> only a server log for the same.

Yes, I can make autoprewarm_dump_launcher to wait until the launched
bgworker set its pid, but this requires one more synchronization
variable between launcher and worker. More than that I see
ShmemInitStruct(), LWLockAcquire can throw ERROR (restarts the
worker), which needs to be called before setting pid. So I thought it
won't be harmful let two concurrent calls to launch workers and we
just log failures. Please let me know if I need to rethink about it.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-05 Thread Mithun Cy
On Mon, Jul 3, 2017 at 3:55 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Fri, Jun 23, 2017 at 3:22 AM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Thu, Jun 15, 2017 at 12:35 AM, Mithun Cy <mithun...@enterprisedb.com> 
>> wrote:
>>
>> * Instead of creating our own buffering system via buffer_file_write()
>> and buffer_file_flush(), why not just use the facilities provided by
>> the operating system?  fopen() et. al. provide buffering, and we have
>> AllocateFile() to provide a FILE *; it's just like
>> OpenTransientFile(), which you are using, but you'll get the buffering
>> stuff for free.  Maybe there's some reason why this won't work out
>> nicely, but off-hand it seems like it might.  It looks like you are
>> already using AllocateFile() to read the dump, so using it to write
>> the dump as well seems like it would be logical.
>>
>
> One thing that is worth considering is AllocateFile is recommended to
> be used for short operations.  Refer text atop AllocateFile().  If the
> number of blocks to be dumped is large, then the file can remain open
> for the significant period of time.

-- Agree. I think I need suggestion on this we will hold on to 1 fd,
but I am not sure what amount of time file being opened qualify as a
case against using AllocateFile().



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-07-02 Thread Mithun Cy
On Tue, Jun 27, 2017 at 11:41 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown <t...@linux.com> wrote:
>>
>> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
>> detect an autoprewarm process already running.  I'd want this to
>> return NULL or an error if called for a 2nd time.
>
> We log instead of error as we try to check only after launching the
> worker and inside worker. One solution could be as similar to
> autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
> memory and check if we can launch worker in backend itself. I will try
> to fix same.

I have fixed it now as follows

+Datum
+autoprewarm_start_worker(PG_FUNCTION_ARGS)
+{
+   pid_t   pid;
+
+   init_apw_shmem();
+   pid = apw_state->bgworker_pid;
+   if (pid != InvalidPid)
+   ereport(ERROR,
+   (errmsg("autoprewarm worker is already running under PID %d",
+   pid)));
+
+   autoprewarm_dump_launcher();
+   PG_RETURN_VOID();
+}

In backend itself, we shall check if an autoprewarm worker is running
then only start the server. There is a possibility if this function is
executed concurrently when there is no worker already running (Which I
think is not a normal usage) then both call will say it has
successfully launched the worker even though only one could have
successfully done that (other will log and silently die). I think that
is okay as the objective was to get one worker up and running.

I have changed the return value to void. The worker could be restarted
when there is an error. So returned pid is not going to be same as
worker pid in such cases. Also, I do not see any use of pid.  Made
documentation changes regarding above changes.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


autoprewarm_17.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-27 Thread Mithun Cy
On Fri, Jun 23, 2017 at 5:45 AM, Thom Brown  wrote:

>
> I also think pg_prewarm.dump_interval should be renamed to
> pg_prewarm.autoprewarm_interval.

Thanks, I have changed it to pg_prewarm.autoprewarm_interval.

>>
>> * In the documentation, don't say "This is a SQL callable function
>> to".  This is a list of SQL-callable functions, so each thing in
>> the list is one.  Just delete this from the beginning of each
>> sentence.


> One thing I couldn't quite make sense of is:
>
> "The autoprewarm process will start loading blocks recorded in
> $PGDATA/autoprewarm.blocks until there is a free buffer left in the
> buffer pool."
>
> Is this saying "until there is a single free buffer remaining in
> shared buffers"?  I haven't corrected or clarified this as I don't
> understand it.

Sorry, that was a typo I wanted to say  until there is no free buffer
left. Fixed in autoprewarm_16.patch.

>
> Also, I find it a bit messy that launch_autoprewarm_dump() doesn't
> detect an autoprewarm process already running.  I'd want this to
> return NULL or an error if called for a 2nd time.

We log instead of error as we try to check only after launching the
worker and inside worker. One solution could be as similar to
autoprewam_dump_now(), the autoprewarm_start_worker() can init shared
memory and check if we can launch worker in backend itself. I will try
to fix same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-27 Thread Mithun Cy
  I
> think that after calling dump_now(true) it should just "continue",
> which will automatically retest got_sigterm.  You could rightly object
> to that plan on the grounds that we then wouldn't recheck got_sighup
> promptly, but you can fix that by moving the got_sighup test to the
> top of the loop, which is a good idea anyway for at least two other
> reasons.  First, you probably want to check for a pending SIGHUP on
> initially entering this function, because something might have changed
> during the prewarm phase, and second, see the previous comment about
> using the "another valid coding pattern" from latch.h, which puts the
> ResetLatch() at the bottom of the loop.

-- Agree, my idea was while we were dumping or just immediately after
dumping if we receive sigterm we need not dump again for shutdown. I
think I am wrong so fixed as you have suggested.

>
> * I think that launch_autoprewarm_dump() should ereport(ERROR, ...)
> rather than just return NULL if the feature is disabled.  Maybe
> something like ... ERROR: pg_prewarm.dump_interval must be
> non-negative in order to launch worker

-- I have removed pg_prewarm.dump_interval = -1 case as you have
suggested below. So no need for error now.

> * Not sure about this one, but maybe we should consider just getting
> rid of pg_prewarm.dump_interval = -1 altogether and make the minimum
> value 0. If pg_prewarm.autoprewarm = on, then we start the worker and
> dump according to the dump interval; if pg_prewarm.autoprewarm = off
> then we don't start the worker automatically, but we still let you
> start it manually.  If you do, it respects the configured
> dump_interval.  With this design, we don't need the error suggested in
> the previous item at all, and the code can be simplified in various
> places --- all the checks for AT_PWARM_OFF go away.  And I don't see
> that we're really losing anything.  There's not much sense in dumping
> but not prewarming or prewarming but not dumping, so having
> pg_prewarm.autoprewarm configure whether the worker is started
> automatically rather than whether it prewarms (with a separate control
> for whether it dumps) seems to make sense.  The one time when you want
> to do one without the other is when you first install the extension --
> during the first server lifetime, you'll want to dump, so that after
> the next restart you have something to preload.  But this design would
> allow that.

-- Agree. Removed the case pg_prewarm.dump_interval = -1. I had
similar doubt which I have tried to raise previously

On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
>>There is another GUC setting pg_prewarm.dump_interval if = -1 we stop
>>the running autoprewarm worker. I have a doubt should we combine these
>>2 entities into one such that it controls the state of autoprewarm
>>worker?

Now I have one doubt, do we need a mechanism to stop running
autoprewarm worker while keeping the server alive? Can I use the
pg_prewarm.autoprewarm for the same purpose?


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


autoprewarm_16.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-14 Thread Mithun Cy
On Mon, Jun 12, 2017 at 7:34 PM, Amit Kapila <amit.kapil...@gmail.com>
wrote:

> On Mon, Jun 12, 2017 at 6:31 AM, Mithun Cy <mithun...@enterprisedb.com>
> wrote:
> > Thanks, Amit,
> >
>
> + /* Perform autoprewarm's task. */
> + if (todo_task == TASK_PREWARM_BUFFERPOOL &&
> + !state->skip_prewarm_on_restart)
>
> Why have you removed above comment in the new patch?  I am not
> pointing this because above comment is meaningful, rather changing
> things in different versions of the patch without informing reviewer
> can increase the time to review.  I feel you can write some better
> comment here.
>

That happened during previous comment fix.  I think I have removed in
patch_12 itself and I have stated same in mail. I felt this code was simple
so there was no need of adding new comments. I have tried to add few now as
suggested.


>
> 1.
> new file mode 100644
> index 000..6c35fb7
> --- /dev/null
> +++ b/contrib/pg_prewarm/pg_prewarm--1.1--1.2.sql
> @@ -0,0 +1,14 @@
> +/* contrib/pg_prewarm/pg_prewarm--1.0--1.1.sql */
>
> In comments, the SQL file name is wrong.
>

-- Sorry, Fixed.


> 2.
> + /* Define custom GUC variables. */
> + if (process_shared_preload_libraries_in_progress)
> + DefineCustomBoolVariable("pg_prewarm.autoprewarm",
> + "Enable/Disable auto-prewarm feature.",
> + NULL,
> + ,
> + true,
> + PGC_POSTMASTER,
> + 0,
> + NULL,
> + NULL,
> + NULL);
> +
> + DefineCustomIntVariable("pg_prewarm.dump_interval",
> +   "Sets the maximum time between two buffer pool dumps",
> + "If set to zero, timer based dumping is disabled."
> + " If set to -1, stops autoprewarm.",
> + _interval,
> + AT_PWARM_DEFAULT_DUMP_INTERVAL,
> + AT_PWARM_OFF, INT_MAX / 1000,
> + PGC_SIGHUP,
> + GUC_UNIT_S,
> + NULL,
> + NULL,
> + NULL);
> +
> + EmitWarningsOnPlaceholders("pg_prewarm");
> +
> + /* If not run as a preloaded library, nothing more to do. */
> + if (!process_shared_preload_libraries_in_progress)
> + return;
>
> a. You can easily write this code such that
> process_shared_preload_libraries_in_progress needs to be checked just
> once.  Move the define of pg_prewarm.dump_interval at first place and
> then check if (!process_shared_preload_libraries_in_progress ) return.
>

-- Thanks I have fixed as suggested. Previously I did it that way to avoid
calling EmitWarningsOnPlaceholders in two different places.


>
> b. Variable name autoprewarm isn't self-explanatory, also if you have
> to search the use of this variable in the code, it is difficult
> because a lot of unrelated usages can pop-up.  How about naming it as
> start_prewarm_worker or enable_autoprewarm or something like that?
>

-- Name was taken as part of previous comments, I think enable_autoprewarm
looks good so renaming it. Please let me know if I need to reconsider same.


>
> 3.
> +static AutoPrewarmSharedState *state = NULL;
>
> Again, the naming of this variable (state) is not meaningful.  How
> about SharedPrewarmState or something like that?
>

-- state is for both prewarm and dump worker I would like to keep it simple
and small, some where else I have used "apw_sigterm_handler" so I think
"apw_state" could be a good compromise. I have renamed functions
to reset_apw_state, init_apw_state in similar lines. Please let me know if
I need to reconsider same.


>
> 4.
> + ereport(LOG,
> + (errmsg("autoprewarm has started")));
> ..
> + ereport(LOG,
> + (errmsg("autoprewarm shutting down")));
>
> How about changing messages as "autoprewarm worker started" and
> "autoprewarm worker stopped" respectively?
>

-- Thanks fixed as suggested.


>
> 5.
> +void
> +dump_block_info_periodically(void)
> +{
> + TimestampTz last_dump_time = GetCurrentTimestamp();
> ..
> + if (TimestampDifferenceExceeds(last_dump_time,
> +   current_time,
> +   (dump_interval * 1000)))
> + {
> + dump_now(true);
> ..
>
> With above coding, it will not dump the very first time it tries to
> dump the blocks information.  I think it is better if it dumps the
> first time and then dumps after dump_interval.  I think it is not
> difficult to arrange above code to do so if you also think that is
> better behavior?
>

-- Thanks agree, fixed as suggested.


>
> 6.
> +dump_now(bool is_bgworker)
> {
> ..
> + if (write(fd, buf, buflen) < buflen)
> + ereport(ERROR,
> + (errcode_for_file_access(),
> + errmsg("could not write to file \"%s\" : %m",
> + transient_dump_file_path)));
> ..
> }
>
> You seem to forget to close and unlink the file in above code path

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-11 Thread Mithun Cy
Thanks, Amit,

On Fri, Jun 9, 2017 at 8:07 PM, Amit Kapila  wrote:

>
> Few comments on the latest patch:
> +
> + /* Prewarm buffer. */
> + buf = ReadBufferExtended(rel, blk->forknum, blk->blocknum, RBM_NORMAL,
> + NULL);
> + if (BufferIsValid(buf))
> + ReleaseBuffer(buf);
> +
> + old_blk = blk;
> + ++pos;
>
> You are incrementing position at different places in this loop. I
> think you can do it once at the start of the loop.


Fixed now moved all of ++pos to one place now.


> 2.
> +dump_now(bool is_bgworker)
> {
> ..
> + fd = OpenTransientFile(transient_dump_file_path,
> +   O_CREAT | O_WRONLY | O_TRUNC, 0666);
>
> +prewarm_buffer_pool(void)
> {
> ..
> + file = AllocateFile(AUTOPREWARM_FILE, PG_BINARY_R);
>
> During prewarm, you seem to be using binary mode to open a file
> whereas during dump binary flag is not passed.  Is there a reason
> for such a difference?
>

-- Sorry fixed now, both use binary.


>
> 3.
> + ereport(LOG,
> + (errmsg("saved metadata info of %d blocks", num_blocks)));
>
> It doesn't seem like a good idea to log this info at each dump
> interval.  How about making this as a DEBUG1 message?


-- Fixed, made it as DEBUG1 along with another message "autoprewarm load
task ended" message.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


autoprewarm_13.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] List of hostaddrs not supported

2017-06-09 Thread Mithun Cy
On Fri, Jun 9, 2017 at 3:22 PM, Heikki Linnakangas  wrote:
> Hmm, there is one problem with our current use of comma as a separator:
you
> cannot use a Unix-domain socket directory that has a comma in the name,
> because it's interpreted as multiple hostnames. E.g. this doesn't work:
>
> psql "host=/tmp/dir,with,commas"
>
> For hostnames, ports, and network addresses (hostaddr), a comma is not a
> problem, as it's not a valid character in any of those.
>
> I don't know if that was considered when this patch was developed. I
> couldn't find a mention of this in the archives. But in any case, that's
> quite orthogonal to the rest of this thread.

I think this was found earlier [1]. But It appeared difficult to fix
without breaking other API's behavior [2]

[1] UDS with comma in name

[2] Fix for comma in UDS path



Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-08 Thread Mithun Cy
I have merged Rafia's patch for cosmetic changes. I have also fixed
some of the changes you have recommended over that. But kept few as it
is since Rafia's opinion was needed on that.

On Wed, Jun 7, 2017 at 5:57 PM, Robert Haas  wrote:
> My experience is the opposite.  If I Google "from another database",
> including the quotes, I get 516,000 hits; if I do the same using "of
> another database", I get 110,000 hits.  I think that means the former
> usage is more popular, although obviously to some degree it's a matter
> of taste.

-- Open.

>
> + *database, tablespace, filenode, forknum, blocknum in
>
> The file doesn't contain the spaces you added here, which is probably
> why Mithun did it as he did, but actually we don't really need this
> level of detail in the file header comment anyway.  I think you could
> drop the specific mention of how blocks are identified.

-- Fixed. It has been removed now.

> + * GUC variable to decide if autoprewarm worker has to be started when
>
> if->whether? has to be->should be?

> Actually this patch uses "if" in various places where I would use
> "whether", but that's probably a matter of taste to some extent.

-- Fixed. I have changed from if to whether.

>
> - * Start of prewarm per-database worker. This will try to load blocks of one
> - * database starting from block info position state->prewarm_start_idx to
> - * state->prewarm_stop_idx.
> + * Connect to the database and load the blocks of that database starting from
> + * the position state->prewarm_start_idx to state->prewarm_stop_idx.
>
> That's a really good rephrasing.  It's more compact and more
> imperative.  The only thing that seems a little off is that you say
> "starting from" and then mention both the start and stop indexes.
> Maybe say "between" instead.

-- It is a half-open interval so rewrote it within the notation [,)

>
> - * Quit if we've reached records for another database. Unless the
> - * previous blocks were of global objects which were combined with
> - * next database's block infos.
> + * Quit if we've reached records for another database. If previous
> + * blocks are of some global objects, then continue pre-warming.
>
> Good.
>
> - * When we reach a new relation, close the old one.  Note, however,
> - * that the previous try_relation_open may have failed, in which case
> - * rel will be NULL.
> + * As soon as we encounter a block of a new relation, close the old
> + * relation. Note, that rel will be NULL if try_relation_open failed
> + * previously, in that case there is nothing to close.
>
> I wrote the original comment here, so you may not be too surprised to
> here that I liked it as it was.  Actually, your rewrite of the first
> sentence seems like it might be better, but I think the second one is
> not.  By deleting "however", you've made the comma after "Note"
> redundant, and by changing "in which case" to "in that case", you've
> made a dependent clause into a comma splice.  I hate comma splices.
>
> https://en.wikipedia.org/wiki/Comma_splice

-- Open

> + * $PGDATA/AUTOPREWARM_FILE to a dsm. Further, these BlockInfoRecords are
> I would probably capitalize DSM here.  There's no data structure
> called dsm (lower-case) and we have other examples where it is
> capitalized in documentation and comment text (see, e.g.
> custom-scan.sgml).

-- Fixed.

>
> + * Since there could be at most one worker for prewarm, locking is not
>
> could -> can?

-- Fixed.

>
> - * For block info of a global object whose database will be 0
> - * try to combine them with next non-zero database's block
> - * infos to load.
> + * For the BlockRecordInfos of a global object, combine them
> + * with BlockRecordInfos of the next non-global object.
>
> Good.  Or even "Combine BlockRecordInfos for a global object with the
> next non-global object", which I think is even more compact.

-- Fixed.

>
> + * If we are here with database having InvalidOid, then only
> + * BlockRecordInfos belonging to global objects exist. Since, we can
> + * not connect with InvalidOid skip prewarming for these objects.
>
> If we reach this point with current_db == InvalidOid, ...

--Fixed.

>
> + *Sub-routine to periodically call dump_now().
>
> Every existing use of the word subroutine in our code based spells it
> with no hyphen.
>
> [rhaas pgsql]$ git grep -- Subroutine | wc -l
>   47
> [rhaas pgsql]$ git grep -- Sub-routine | wc -l
>0
>
> Personally, I find this change worse, because calling something a
> subroutine is totally generic, like saying that you met a "person" or
> that something was "nice".  Calling it a loop gives at least a little
> bit of specific information.

-- Fixed. We call it a loop now.

>
> + * Call dum_now() at regular intervals defined by GUC variable 

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-08 Thread Mithun Cy
On Tue, Jun 6, 2017 at 3:48 PM, Neha Sharma
 wrote:
> Hi,
>
> I have been testing this feature for a while and below are the observations
> for few scenarios.
>
> Observation:
> scenario 1: If we set pg_prewarm.dump_interval = -1.0,we get an additional
> warning message in logfile and instead of ending the task of auto-dump it
> executes successfully.
> [centos@test-machine bin]$ more logfile
> 2017-06-06 08:39:53.127 GMT [21905] WARNING:  invalid value for parameter
> "pg_prewarm.dump_interval": "-1.0"
> 2017-06-06 08:39:53.127 GMT [21905] HINT:  Valid units for this parameter
> are "ms", "s", "min", "h", and "d".
>
> postgres=# show pg_prewarm.dump_interval;
>  pg_prewarm.dump_interval
> --
>  5min
> (1 row)
>
> scenario 2: If we set pg_prewarm.dump_interval = 0.0,we get an additional
> warning message in logfile and the message states that the task was started
> and the worker thread it is also active,but the dump_interval duration is
> set to default 5 min (300 sec) instead of 0.
>
> [centos@test-machine bin]$ ps -ef | grep prewarm
> centos   21980 21973  0 08:54 ?00:00:00 postgres: bgworker:
> autoprewarm
>
> [centos@test-machine bin]$ more logfile
> 2017-06-06 09:20:52.436 GMT [3] WARNING:  invalid value for parameter
> "pg_prewarm.dump_interval": "0.0"
> 2017-06-06 09:20:52.436 GMT [3] HINT:  Valid units for this parameter
> are "ms", "s", "min", "h", and "d".
> postgres=# show pg_prewarm.dump_interval;
>  pg_prewarm.dump_interval
> --
>  5min
> (1 row)

dump_interval is int type custom GUC variable so passing floating
value is illegal here, but the reason we are only throwing a warning
and setting it to default 5 mins is that of existing behavior

@define_custom_variable

/*
 * Assign the string value(s) stored in the placeholder to the real
 * variable.  Essentially, we need to duplicate all the active and stacked
 * values, but with appropriate validation and datatype adjustment.
 *
 * If an assignment fails, we report a WARNING and keep going.  We don't
 * want to throw ERROR for bad values, because it'd bollix the add-on
 * module that's presumably halfway through getting loaded.  In such cases
 * the default or previous state will become active instead.
 */

/* First, apply the reset value if any */
if (pHolder->reset_val)
(void) set_config_option(name, pHolder->reset_val,
 pHolder->gen.reset_scontext,
 pHolder->gen.reset_source,
 GUC_ACTION_SET, true, WARNING, false);

I think this should be fine as it is defined behavior for all of the
add-on modules. Please let me know if you have any suggestions.


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-06-04 Thread Mithun Cy
On Wed, May 31, 2017 at 10:18 PM, Robert Haas  wrote:
> + *contrib/autoprewarm.c
>
> Wrong.

-- Oops Sorry fixed.

> +Oiddatabase;/* database */
> +Oidspcnode;/* tablespace */
> +Oidfilenode;/* relation's filenode. */
> +ForkNumberforknum;/* fork number */
> +BlockNumber blocknum;/* block number */
>
> Why spcnode rather than tablespace?  Also, the third line has a
> period, but not the others.  I think you could just drop these
> comments; they basically just duplicate the structure names, except
> for spcnode which doesn't but probably should.

-- Dropped comments and changed spcnode to tablespace.

> +boolcan_do_prewarm; /* if set can't do prewarm task. */
>
> The comment and the name of the variable imply opposite meanings.

-- Sorry a typo. Now this variable has been removed as you have
suggested with new variables in AutoPrewarmSharedState.

> +/*
> + * detach_blkinfos - on_shm_exit detach the dsm allocated for blockinfos.
> + */
> I assume you don't really need this.  Presumably process exit is going
> to detach the DSM anyway.

-- Yes considering process exit will detach the dsm, I have removed
that function.

> +if (seg == NULL)
> +ereport(ERROR,
> +(errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
> + errmsg("unable to map dynamic shared memory segment")));
>
> Let's use the wording from the message in parallel.c rather than this
> wording.  Actually, we should probably (as a separate patch) change
> test_shm_mq's worker.c to use the parallel.c wording also.

-- I have corrected the message with "could not map dynamic shared
memory segment" as in parallel.c

> +SetCurrentStatementStartTimestamp();
>
> Why do we need this?

-- Removed Sorry forgot to remove same when I removed the SPI connection code.

> +StartTransactionCommand();
>
> Do we need a transaction?  If so, how about starting a new transaction
> for each relation instead of using a single one for the entire
> prewarm?

-- We do relation_open hence need a transaction. As suggested now we
start a new transaction on every new relation.

> +if (status == BGWH_STOPPED)
> +return;
> +
> +if (status == BGWH_POSTMASTER_DIED)
> +{
> +ereport(ERROR,
> +(errcode(ERRCODE_INSUFFICIENT_RESOURCES),
> +  errmsg("cannot start bgworker autoprewarm without postmaster"),
> + errhint("Kill all remaining database processes and restart"
> + " the database.")));
> +}
> +
> +Assert(0);
>
> Instead of writing it this way, remove the first "if" block and change
> the second one to Assert(status == BGWH_STOPPED).  Also, I'd ditch the
> curly braces in this case.

-- Fixed as suggested.

>
> +file = fopen(AUTOPREWARM_FILE, PG_BINARY_R);
>
> Use AllocateFile() instead.  See the comments for that function and at
> the top of fd.c.  Then you don't need to worry about leaking the file
> handle on an error, so you can remove the fclose() before ereport()
 now> stuff -- which is incomplete anyway, because you could fail e.g.
> inside dsm_create().

-- Using AllocateFile now.

>
> + errmsg("Error reading num of elements in \"%s\" for"
> +" autoprewarm : %m", AUTOPREWARM_FILE)));
>
> As I said in a previous review, please do NOT split error messages
> across lines like this.  Also, this error message is nothing close to
> PostgreSQL style. Please read
> https://www.postgresql.org/docs/devel/static/error-style-guide.html
> and learn to follow all those guidelines written therein.  I see at
> least 3 separate problems with this message.

-- Thanks, I have tried to fix it now.

> +num_elements = i;
>
> I'd do something like if (i != num_elements) elog(ERROR, "autoprewarm
> block dump has %d entries but expected %d", i, num_elements);  It
> seems OK for this to be elog() rather than ereport() because the file
> should never be corrupt unless the user has cheated by hand-editing
> it.

-- Fixed as suggested. Now eloged as an ERROR.

> I think you can get rid of next_db_pos altogether, and this
> prewarm_elem thing too.  Design sketch:
>
> 1. Move all the stuff that's in prewarm_elem into
> AutoPrewarmSharedState.  Rename start_pos to prewarm_start_idx and
> end_of_blockinfos to prewarm_stop_idx.
>
> 2. Instead of computing all of the database ranges first and then
> launching workers, do it all in one loop.  So figure out where the
> records for the current database end and set prewarm_start_idx and
> prewarm_end_idx appropriately.  Launch a worker.  When the worker
> terminates, set prewarm_start_idx = prewarm_end_idx and advance
> prewarm_end_idx to the end of the next database's records.
>
> This saves having to allocate memory for the next_db_pos array, and it
> also avoids this crock:
>
> +memcpy(, 

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-05-30 Thread Mithun Cy
On Tue, May 30, 2017 at 12:36 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:
> On 27.10.2016 14:39, Mithun Cy wrote:
> And as far as I understand pg_autoprewarm has all necessary infrastructure
> to do parallel load. We just need to spawn more than one background worker
> and specify
> separate block range for each worker.
>
> Do you think that such functionality (parallel autoprewarm) can be useful
> and be easily added?

I have not put any attention on making the autoprewarm parallel. But
as per the current state of the code, making the subworkers to load
the blocks in parallel should be possible and I think could be easily
added. Probably we need a configurable parameter to restrict max
number of parallel prewarm sub-workers. Since I have not thought much
about this I might not be aware of all of the difficulties around it.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-05-30 Thread Mithun Cy
On Tue, May 30, 2017 at 10:16 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> Thanks Robert,

Sorry, there was one more mistake ( a typo) in dump_now() instead of
using pfree I used free corrected same in the new patch v10.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


autoprewarm_10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-05-29 Thread Mithun Cy
Thanks Robert,

On Wed, May 24, 2017 at 5:41 PM, Robert Haas  wrote:
> + *
> + * Once the "autoprewarm" bgworker has completed its prewarm task, it will
> + * start a new task to periodically dump the BlockInfoRecords related to 
> blocks
> + * which are currently in shared buffer pool. Upon next server restart, the
> + * bgworker will prewarm the buffer pool by loading those blocks. The GUC
> + * pg_prewarm.dump_interval will control the dumping activity of the 
> bgworker.
> + */
>
> Make this part of the file header comment.

-- Thanks, I have moved same as part of header description.

> Also, add an enabling GUC.
> The default can be true, but it should be possible to preload the
> library so that the SQL functions are available without a dynamic
> library load without requiring you to get the auto-prewarm behavior.
> I suggest pg_prewarm.autoprewarm = true / false.

-- Thanks, I have added a boolean GUC pg_prewarm.autoprewarm with
default true. I have made it as PGC_POSTMASTER level variable
considering the intention is here to avoid starting the autoprewarm
worker. Need help, have I missed anything? Currently, sql callable
functions autoprewarm_dump_now(), launch_autoprewarm_dump() are only
available after create extension pg_prewarm command will this change
now?
There is another GUC setting pg_prewarm.dump_interval if = -1 we stop
the running autoprewarm worker. I have a doubt should we combine these
2 entities into one such that it control the state of autoprewarm
worker?

> Your SigtermHandler and SighupHandler routines are still capitalized
> in a way that's not very consistent with what we do elsewhere.  I
> think all of our other signal handlers have names_like_this() not
> NamesLikeThis().
>

-- handler functions are renamed for example apw_sigterm_handler, as
similar in autovacuum.c

> + * ==types and variables used by autoprewam  
> =
>
> Spelling.

-- Fixed, Sorry.


> + * Meta-data of each persistent block which is dumped and used to load.
>
> Metadata

-- Fixed.


> +typedef struct BlockInfoRecord
> +{
> +Oiddatabase;/* database */
> +OidspcNode;/* tablespace */
> +Oidfilenode;/* relation's filenode. */
> +ForkNumberforknum;/* fork number */
> +BlockNumber blocknum;/* block number */
> +} BlockInfoRecord;
>
> spcNode is capitalized differently from all of the other members.

-- renamed from spcNode to spcnode.
>
> +LWLock   *lock;/* protects SharedState */
>
> Just declare this as LWLock lock, and initialize it using
> LWLockInitialize.  The way you're doing it is more complicated.

-- Fixed as suggested
LWLockInitialize(>lock, LWLockNewTrancheId());

> +static dsa_area *AutoPrewarmDSA = NULL;
>
> DSA seems like more than you need here.  There's only one allocation
> being performed.  I think it would be simpler and less error-prone to
> use DSM directly.  I don't even think you need a shm_toc.  You could
> just do:
>
> seg = dsm_create(segsize, 0);
> blkinfo = dsm_segment_address(seg);
> Then pass dsm_segment_handle(seg) to the worker using bgw_main_arg or
> bgw_extra, and have it call dsm_attach.  An advantage of this approach
> is that you only allocate the memory you actually need, whereas DSA
> will allocate more, expecting that you might do further allocations.

- Thanks Fixed. And we pass following arguments to subwrokers through bgw_extra
/*
 * The block_infos allocated to each sub-worker to do prewarming.
 */
typedef struct prewarm_elem
{
dsm_handle  block_info_handle;  /* handle to dsm seg of block_infos */
Oid database;   /* database to connect and load */
uint32  start_pos;  /* start position within block_infos from
 * which sub-worker start prewaring blocks. */
uint32  end_of_blockinfos;  /* End of block_infos in dsm */
} prewarm_elem;

to distribute the prewarm work among subworkers.

>
> +pg_qsort(block_info_array, num_blocks, sizeof(BlockInfoRecord),
> + blockinfo_cmp);
>
> I think it would make more sense to sort the array on the load side,
> in the autoprewarm worker, rather than when dumping.

Fixed Now sorting block_infos just before loading the blocks

> + errmsg("autoprewarm: could not open \"%s\": %m",
> +dump_file_path)));
>
> Use standard error message wordings!  Don't create new translatable
> messages by adding "autoprewarm" to error messages.  There are
> multiple instances of this throughout the patch.

-- Removed autoprewarm as part of sufix in error message in all of the places.


> +snprintf(dump_file_path, sizeof(dump_file_path),
> + "%s", AUTOPREWARM_FILE);
>
> This is completely pointless.  You can get rid of the dump_file_path
-- dump_file_path is removed.


>
> +SPI_connect();
> +

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-05-24 Thread Mithun Cy
On Tue, May 23, 2017 at 7:06 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> Thanks, Andres,
>
> I have tried to fix all of your comments.

There was a typo issue in previous patch 07 where instead of == I have
used !=. And, a mistake in comments. I have corrected same now.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


autoprewarm_08.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-05-23 Thread Mithun Cy
Thanks, Andres,

I have tried to fix all of your comments. One important change has
happened with this patch is previously we used to read one block info
structure at a time and load it. But now we read all of them together
and load it into as DSA and then we distribute the block infos to the
subgroups to load corresponding blocks. With this portability issue
which I have mentioned above will no longer exists as we do not store
any map tables within the dump file.

On Thu, Apr 6, 2017 at 4:12 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2017-03-13 18:45:00 +0530, Mithun Cy wrote:
>> launched only after previous one is finished. All of this will only
>> start if the database has reached a consistent state.
>
> Hm. For replay performance it'd possibly be good to start earlier,
> before reaching consistency.  Is there an issue starting earlier?

Earlier patches used to start loading blocks before reaching a
consistent state. Then Robert while reviewing found a basic flaw in my
approach.  The function DropRelFileNodesAllBuffers do not expect
others to load the blocks concurrently while it is getting rid of
buffered blocks. So has to delay loading until database reaches
consistent state so that we can connect to each database and take a
relation lock before loading any of theirs blocks.

>> + * -- Automatically prewarm the shared buffer pool when server restarts.
>
> Don't think we ususally use -- here.

-- Fixed.


>> + *   Copyright (c) 2013-2017, PostgreSQL Global Development Group
>
> Hm, that's a bit of a weird date range.

-- changed it to 2016-2017 is that right?


> The pg_prewarm.c in there looks like some search & replace gone awry.

-- sorry Fixed.


>> +#include "utils/rel.h"
>> +#include "port/atomics.h"
>
> I'd rather just sort these alphabetically.
> I think this should rather be in the initial header.

-- Fixed as suggested and have moved everything into a header file pg_prewarm.h.


> s/until there is a/until there is no/?

-- Fixed.

>
>> +/*
>> + * 
>> 
>> + * ===SIGNAL HANDLERS
>> ===
>> + * 
>> 
>> + */
>
> Hm...

-- I have reverted those cosmetic changes now.

>
>> +static void sigtermHandler(SIGNAL_ARGS);
>> +static void sighupHandler(SIGNAL_ARGS);
>
> I don't think that's a casing we commonly use.  We mostly use CamelCase
> or underscore_case.
>

-- Fixed with CamelCase.


>> + *   Signal handler for SIGUSR1.
>> + */
>> +static void
>> +sigusr1Handler(SIGNAL_ARGS)

> Hm, what's this one for?

-- The prewarm sub-workers will notify with SIGUSR1 on their
startup/shutdown. Updated the comments.

>> +/*
>> + * Shared state information about the running autoprewarm bgworker.
>> + */
>> +typedef struct AutoPrewarmSharedState
>> +{
>> + pg_atomic_uint32 current_task;  /* current tasks performed by
>> +
>>   * autoprewarm workers. */
>> +} AutoPrewarmSharedState;
>
> Hm.  Why do we need atomics here?  I thought there's no concurrency?

There are 3 methods in autoprewarm.
A: prealoaded prewarm bgworker which also prewarm and then periodic dumping;
B: bgwoker launched by launch_autoprewarm_dump() which do the periodic dumping.
C: Immediate dump by backends.

We do not want 2 bgworkers started and working concurrently. and do
not want to dump while prewarm task is running. As you have suggested
rewrote a simple logic with the use of a boolean variable.


>> +sort_cmp_func(const void *p, const void *q)
>> +{
>
> rename to blockinfo_cmp?

-- Fixed.

>
> Superflous parens (repeated a lot).

-- Fixed in all places.


> Hm.  Is it actually helpful to store the file as text?  That's commonly
> going to increase the size of the file quite considerably, no?

-- Having in the text could help in readability or if we want to
modify/adjust it as needed. I shall so a small experiment to check how
much we caould save and produce a patch on top of this for same.

>
> But what is this actually used for?  I thought Robert, in
> http://archives.postgresql.org/message-id/CA%2BTgmoa%3DUqCL2mR%2B9WTq05tB3Up-z4Sv2wkzkDxDwBP7Mj_2_w%40mail.gmail.com
> suggested storing the filenode in the dump, and then to use
> RelidByRelfilenode to get the corresponding relation?

-- Fixed as suggested directly calling RelidByRelfilenode now.


>> +load_one_database(Datum main_arg)
>> +{
>
>> + /* check if file exists and open file in read mode. */
>> + snprintf(dump_file_pat

Re: [HACKERS] TAP tests take a long time

2017-04-12 Thread Mithun Cy
On Wed, Apr 12, 2017 at 6:24 PM, Amit Kapila 
wrote:
>
> I have looked into the tests and I think we can do some optimization
> without losing much on code coverage.  First is we are doing both
> Vacuum Full and Vacuum on hash_split_heap in the same test after
> executing few statements, it seems to me that we can avoid doing
> Vacuum Full.  Also, I think we can try by reducing the number of
> inserts in hash_split_heap to see if we can achieve the code coverage
> of split bucket code path.  Finally, I am not sure if Reindex Index
> hash_split_index is required for code coverage, but we might need it
> for the sake of testing that operation.
>
> Mithun, as you are the original author of these tests, can you please
> try some of the above optimizations and any others you can think of
> and see if we can reduce the time for hash index tests?

I have tried to optimize the tests maintaining the same coverage we were
able to get with it.

Original coverage

Lines:  1466 2341  62.6 %
Functions: 79101   78.2 %

After test changes

Lines:   15342341 65.5 %
Functions: 79 101 78.2 %

changes done
Earlier :

CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);

+ Time: 8.950 ms

INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 7) a;
+ Time: 3135.098 ms (00:03.135)

VACUUM FULL hash_split_heap;

+ Time: 2748.146 ms (00:02.748)
REINDEX INDEX hash_split_index;

+ Time: 114.290 ms

Insertion and vacuum full were taking most of the time in tests. I have
removed the vacuum full as suggested by you, which is taken care by vacuum
statement at the end, And reduced the number of rows inserted to trigger a
split, by inserting some records before creating hash index. This way we
start with less number of buckets and thus splits will happen much early.

Now :

+ INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 500) a;

+ Time: 2.946 ms

CREATE INDEX hash_split_index on hash_split_heap USING HASH (keycol);

! Time: 8.933 ms

! INSERT INTO hash_split_heap SELECT 1 FROM generate_series(1, 5000) a;

! Time: 81.674 ms

Also removed REINDEX INDEX as suggested which was not adding for coverage.
With this, I think tests should run significantly faster now.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


hash_index_optimize_testruns.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-04-07 Thread Mithun Cy
On Thu, Apr 6, 2017 at 4:12 AM, Andres Freund <and...@anarazel.de> wrote:
> On 2017-03-13 18:45:00 +0530, Mithun Cy wrote:
>> I have implemented a similar logic now. The prewarm bgworker will
>> launch a sub-worker per database in the dump file. And, each
>> sub-worker will load its database block info. The sub-workers will be
>> launched only after previous one is finished. All of this will only
>> start if the database has reached a consistent state.
>
> Hm. For replay performance it'd possibly be good to start earlier,
> before reaching consistency.  Is there an issue starting earlier?

Thanks Andres for a detailed review. I will try to address them in my next
post. I thought it is important to reply to above comment before that.
Earlier patches used to start loading blocks before reaching a consistent
state. Then Robert while reviewing found a basic flaw in my approach[1].
The function DropRelFileNodesAllBuffers do not expect others to load the
blocks concurrently while it is getting rid of buffered blocks. So has to
delay loading until database reaches consistent state so that we can
connect to each database and take a relation lock before loading any of
theirs blocks.


[1] cannot load blocks without holding relation lock
<https://www.postgresql.org/message-id/CA%2BTgmoYNF_wfdwQ3z3713zKy2j0Z9C32WJdtKjvRWzeY7JOL4g%40mail.gmail.com>


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-04-04 Thread Mithun Cy
On Tue, Apr 4, 2017 at 9:18 AM, Robert Haas  wrote:
> Committed.

Thanks Robert,

And also sorry, one unfortunate thing happened in the last patch while
fixing one of the review comments a variable disappeared from the
equation
@_hash_spareindex.

splitpoint_phases +=
-   (((num_bucket - 1) >> (HASH_SPLITPOINT_PHASE_BITS + 1)) &
+   (((num_bucket - 1) >>
+ (splitpoint_group - (HASH_SPLITPOINT_PHASE_BITS + 1))) &
 HASH_SPLITPOINT_PHASE_MASK);   /* to 0-based value. */

I wanted most significant 3 bits. And while fixing comments in patch11
I unknowingly somehow removed splitpoint_group from the equation.
Extremely sorry for the mistake. Thanks to Ashutosh Sharma for
pointing the mistake.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


_hash_spareindex_defect.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-04-01 Thread Mithun Cy
On Sat, Apr 1, 2017 at 12:31 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:

> Also adding a patch which implements the 2nd way.
Sorry, I forgot to add sortbuild_hash patch, which also needs similar
changes for the hash_mask.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


sortbuild_hash_A_2.patch
Description: Binary data


yet_another_expand_hashbucket_efficiently_15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-04-01 Thread Mithun Cy
On Sat, Apr 1, 2017 at 7:05 AM, Robert Haas  wrote:
> Hmm, don't the changes to contrib/pageinspect/expected/hash.out
> indicate that you've broken something?  The hash index has only 4
> buckets, so the new code shouldn't be doing anything differently, but
> you've got highmask and lowmask changing for some reason.

The high mask calculation has got changed a bit to accommodate
num_buckets  which are not 2-power number.
If num_bucket is not a 2-power number highmask should be its immediate
next ((2^x) - 1) and low mask should be (highmask >> 1) to cover the
first half of the buckets. Trying to generalize same has changed the
masks for 2-power num_buckets from older implementation.

+ /* set highmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets))) - 1;

But this do not cause any adverse effect the high and low mask is
sufficiently enough to get the same mod, If we add one more bucket
then
@_hash_expandtable, immediately we make the masks bigger.
if (new_bucket > metap->hashm_highmask)
{
/* Starting a new doubling */
metap->hashm_lowmask = metap->hashm_highmask;
metap->hashm_highmask = new_bucket | metap->hashm_lowmask;

The state (metap->hashm_highmask == metap->hashm_maxbucket) is natural
state to occur while hash index is growing and just before doubling.

Another choice I could have made is bump a number so that for 2-power
num_buckets will get highmask as similar to old code, and non 2-power
num_buckets highmask will be immediate next ((2^x) - 1).
+ /* set highmask, which should be sufficient to cover num_buckets. */
+ metap->hashm_highmask = (1 << (_hash_log2(num_buckets + 1))) - 1;

It was just a personal preference I choose 1, as it appeared
consistent with running state of hash index expansion.

Also adding a patch which implements the 2nd way.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


yet_another_expand_hashbucket_efficiently_15.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-30 Thread Mithun Cy
Thanks, I have tried to fix all of the comments.

On Fri, Mar 31, 2017 at 8:10 AM, Robert Haas <robertmh...@gmail.com> wrote:
> On Thu, Mar 30, 2017 at 2:36 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
>> Thanks Robert, I have tried to fix the comments given as below.
>
> Thanks.
>
> Since this changes the on-disk format of hash indexes in an
> incompatible way, I think it should bump HASH_VERSION.  (Hopefully
> that doesn't preclude REINDEX?)  pg_upgrade should probably also be
> patched to issue a warning if upgrading from a version < 10 to a
> version >= 10 whenever hash indexes are present; I thought we had
> similar cases already, but I don't see them at the moment.  Maybe we
> can get Bruce or someone to give us some advice on exactly what should
> be done here.

As of now increasing version ask us to REINDEX (metapage access verify
whether we are in right version)
postgres=# set enable_seqscan= off;
SET
postgres=# select * from t1 where i = 10;
ERROR:  index "hash2" has wrong hash version
HINT:  Please REINDEX it.
postgres=# insert into t1 values(10);
ERROR:  index "hash2" has wrong hash version
HINT:  Please REINDEX it.

postgres=# REINDEX INDEX hash2;
REINDEX
postgres=# select * from t1 where i = 10;
 i

 10
(1 row)

Last time we changed this version from 1 to 2
(4adc2f72a4ccd6e55e594aca837f09130a6af62b), from logs I see no changes
for upgrade specifically.

Hi Bruce, can you please advise us what should be done here.

> In a couple of places, you say that a splitpoint group has 4 slots
> rather than 4 phases.

--Fixed

> I think that in _hash_get_totalbuckets(), you should use blah &
> HASH_SPLITPOINT_PHASE_MASK rather than blah %
> HASH_SPLITPOINT_PHASES_PER_GRP for consistency with _hash_spareindex
> and, perhaps, speed.  Similarly, instead of blah /
> HASH_SPLITPOINT_PHASES_PER_GRP, use blah >>
> HASH_SPLITPOINT_PHASE_BITS.

--Fixed

>
> buckets_toadd is punctuated oddly.  buckets_to_add?  Instead of
> hand-calculating this, how about calculating it as
> _hash_get_totalbuckets(spare_ndx) - _hash_get_totalbuckets(spare_ndx -
> 1)?

I think this should do that, considering new_bucket is nothng but
1-based max_buckets.
buckets_to_add = _hash_get_totalbuckets(spare_ndx) - new_bucket;

That makes me do away with

+#define HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+ (((sp_phase) < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) ? \
+ (sp_phase) : \
+ sp_phase) - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >> \
+   HASH_SPLITPOINT_PHASE_BITS) + \
+  HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE))

as this is now used in only one place _hash_get_totalbuckets.
I also think the comments above can be removed now. As we have removed
the code related to multi-phased allocation there.

+ * The number of buckets in the new splitpoint group is equal
to the
+ * total number already in existence. But we do not allocate
them at
+ * once. Each splitpoint group will have 4 phases, we
distribute the
+ * buckets equally among them. So we allocate only one fourth
of total
+ * buckets in new splitpoint group at a time to consume one phase after
+ * another.

> +uint32  splitpoint_group = 0;
>
> Don't need the  = 0 here; the next reference to this variable is an
> unconditional initialization.

--Fixed, with new code splitpoint_group is not needed.

>
> + */
> +
> +splitpoint_group = 
> HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(spare_ndx);
>
> I would delete the blank line.

--Fixed.

>
> - * should start from ((2 ^ i) + metap->hashm_spares[i - 1] + 1).
> + * should start from
> + * (_hash_get_totalbuckets(i) + metap->hashm_spares[i - 1] + 1).
>
> Won't survive pgindent.

--Fixed as pgindent has suggested.

>
> - * The number of buckets in the new splitpoint is equal to the total
> - * number already in existence, i.e. new_bucket.  Currently this maps
> - * one-to-one to blocks required, but someday we may need a more
> - * complicated calculation here.  We treat allocation of buckets as a
> - * separate WAL-logged action.  Even if we fail after this operation,
> - * won't leak bucket pages; rather, the next split will consume this
> - * space. In any case, even without failure we don't use all the 
> space
> - * in one split operation.
> + * The number of buckets in the new splitpoint group is equal to the
> + * total number already in existence. But we do not allocate them at
> + * once. Each splitpoint group will have 4 slots, we distribute the
> + * buckets equally among them. So we allocate only one fourth of 
> total
> + * buckets in new splitpoint group at a time 

Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-30 Thread Mithun Cy
Thanks Robert, I have tried to fix the comments given as below.

On Thu, Mar 30, 2017 at 9:19 PM, Robert Haas  wrote:
> I think that the macros in hash.h need some more work:
>
> - Pretty much any time you use the argument of a macro, you need to
> parenthesize it in the macro definition to avoid surprises if the
> macros is called using an expression.  That isn't done consistently
> here.

--I have tried to fix same in the latest patch.

> - The macros make extensive use of magic numbers like 1, 2, and 3.  I
> suggest something like:
>
> #define SPLITPOINT_PHASE_BITS 2
> #define SPLITPOINT_PHASES_PER_GROUP (1 << SPLITPOINT_PHASE_BITS)
>
> And then use SPLITPOINT_PHASE_BITS any place where you're currently
> saying 2.  The reference to 3 is really SPLITPOINT_PHASE_BITS + 1.

-- Taken modified same in the latest patch.

> - Many of these macros are only used in one place.  Maybe just move
> the computation to that place and get rid of the macro.  For example,
> _hash_spareindex() could be written like this:
>
> if (splitpoint_group < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)
> return splitpoint_group;
>
> /* account for single-phase groups */
> splitpoint = SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE;
>
> /* account for completed groups */
> splitpoint += (splitpoint_group -
> SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) << SPLITPOINT_PHASE_BITS;
>
> /* account for phases within current group */
> splitpoint += (bucket_num >> (SPLITPOINT_PHASE_BITS + 1)) &
> SPLITPOINT_PHASE_MASK;
>
> return splitpoint;
>
> That eliminates the only use of two complicated macros and is in my
> opinion more clear than what you've currently got.

-- Taken, also rewrote _hash_get_totalbuckets in similar lines.

With that, we will end up with only 2 macros which have some computing code
+/* defines max number of splitpoit phases a hash index can have */
+#define HASH_MAX_SPLITPOINT_GROUP 32
+#define HASH_MAX_SPLITPOINTS \
+ (((HASH_MAX_SPLITPOINT_GROUP - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) * \
+  HASH_SPLITPOINT_PHASES_PER_GRP) + \
+ HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE)
+
+/* given a splitpoint phase get its group */
+#define HASH_SPLITPOINT_PHASE_TO_SPLITPOINT_GRP(sp_phase) \
+ (((sp_phase) < HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) ? \
+ (sp_phase) : \
+ sp_phase) - HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE) >> \
+   HASH_SPLITPOINT_PHASE_BITS) + \
+  HASH_SPLITPOINT_GROUPS_WITH_ONE_PHASE))

> - Some of these macros lack clear comments explaining their purpose.

-- I have written some comments to explain the use of the macros.

> - Some of them don't include HASH anywhere in the name, which is
> essential for a header that may easily be included by non-hash index
> code.

-- Fixed, all MACROS are prefixed with HASH

> - The names don't all follow a consistent format.  Maybe that's too
> much to hope for at some level, but I think they could be more
> consistent than they are.

-- Fixed, apart from old HASH_MAX_SPLITPOINTS rest all have a prefix
HASH_SPLITPOINT.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


yet_another_expand_hashbucket_efficiently_12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-30 Thread Mithun Cy
On Thu, Mar 30, 2017 at 7:23 PM, Robert Haas  wrote:
> I think approach B is incorrect.  Suppose we have 1536 buckets and
> hash values 2048, 2049, 4096, 4097, 6144, 6145, 8192, and 8193.  If I
> understand correctly, each of these values should be mapped either to
> bucket 0 or to bucket 1, and the goal of the sort is to put all of the
> bucket 0 tuples before all of the bucket 1 tuples, so that we get
> physical locality when inserting.  With approach A, the sort keys will
> match the bucket numbers -- we'll be sorting the list 0, 1, 0, 1, 0,
> 1, 0, 1 -- and we will end up doing all of the inserts to bucket 0
> before any of the inserts to bucket 1.  With approach B, we'll be
> sorting 512, 513, 1024, 1025, 0, 1, 512, 513 and will end up
> alternating inserts to bucket 0 with inserts to bucket 1.

Oops sorry, yes 2 denominators are different (one used in an insert
and another used in sorting keys) we will end up with different bucket
numbers. I think in patch B, I should have actually taken next 2-power
number of 1536 as the denominator and try to get the mod value. If the
mod value is > 1536 then reduce the denominator by half and retake the
mod to get the bucket within 1536. Which is what effectively Patch A
is doing. Approach B is a blunder, I apologize for that mistake. I
think Patch A should be considered. If adding the members of struct
Tuplesortstate is a concern I will rewrite Patch B as said above.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-29 Thread Mithun Cy
Thanks, Amit for a detailed review.

On Wed, Mar 29, 2017 at 4:09 PM, Amit Kapila  wrote:
> Okay, your current patch looks good to me apart from minor comments,
> so marked as Read For Committer.  Please either merge the
> sort_hash_b_2.patch with main patch or submit it along with next
> revision for easier reference.

I will keep it separated just in case commiter likes
sortbuild_hash_A.patch. We can use either of sortbuild_hash_*.patch on
top of main patch.

>
> Few minor comments:
> 1.
> +splitpoint phase S.  The hashm_spares[0] is always 0, so that buckets 0 and 1
> +(which belong to splitpoint group 0's phase 1 and phase 2 respectively) 
> always
> +appear at block numbers 1 and 2, just after the meta page.
>
> This explanation doesn't seem to be right as with current patch we
> start phased allocation only after splitpoint group 9.

Again a mistake, removed the sentence in parentheses.

> 2.
> -#define HASH_MAX_SPLITPOINTS 32
>  #define HASH_MAX_BITMAPS 128
>
> +#define SPLITPOINT_PHASES_PER_GRP 4
> +#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
> +#define SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE 10
> +#define HASH_MAX_SPLITPOINTS \
> + ((32 - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) * \
> + SPLITPOINT_PHASES_PER_GRP) + \
> + SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE
>
> You have changed the value of HASH_MAX_SPLITPOINTS, but the comments
> explaining that value are still unchanged.  Refer below text.
> "The limitation on the size of spares[] comes from the fact that there's
>  * no point in having more than 2^32 buckets with only uint32 hashcodes."

The limitation is still indirectly imposed by the fact that we can
have only 2^32 buckets. But I also added a note that
HASH_MAX_SPLITPOINTS also considers that after
SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE bucket allocation will be done
in multiple(exactly 4) phases.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


yet_another_expand_hashbucket_efficiently_11.patch
Description: Binary data


sortbuild_hash_B_2.patch
Description: Binary data


sortbuild_hash_A.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-29 Thread Mithun Cy
That means at every new
+split point we double the existing number of buckets.  Allocating huge chucks

On Mon, Mar 27, 2017 at 11:56 PM, Jesper Pedersen
 wrote:

> I ran some performance scenarios on the patch to see if the increased
> 'spares' allocation had an impact. I havn't found any regressions in that
> regard.

Thanks Jasper for testing the patch.

> Attached patch contains some small fixes, mainly to the documentation - on
> top of v7.

I have taken some of the grammatical and spell check issues you have
mentioned. One major thing I left it as it is term "splitpoint" which
you have tried to change in many places to "split point", The
splitpoint is not introduced by me, it was already used in many
places, so I think it is acceptable to use that term. I think I shall
not add changes which are not part of the core issue. I think another
patch on top of this should be okay.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-29 Thread Mithun Cy
On Wed, Mar 29, 2017 at 10:12 AM, Amit Kapila  wrote:

>> I wonder if we should consider increasing
>> SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE somewhat.  For example, split
>> point 4 is responsible for allocating only 16 new buckets = 128kB;
>> doing those in four groups of two (16kB) seems fairly pointless.
>> Suppose we start applying this technique beginning around splitpoint 9
>> or 10.  Breaking 1024 new buckets * 8kB = 8MB of index growth into 4
>> phases might save enough to be worthwhile.
>>
>
> 10 sounds better point to start allocating in phases.

+1. At splitpoint group 10 we will allocate (2 ^ 9) buckets = 4MB in
total and each phase will allocate 2 ^ 7 buckets = 128 * 8kB = 1MB.

> Few other comments:
> +/*
> + * This is just a trick to save a division operation. If you look into the
> + * bitmap of 0-based bucket_num 2nd and 3rd most significant bit will 
> indicate
> + * which phase of allocation the bucket_num belongs to with in the group. 
> This
> + * is because at every splitpoint group we allocate (2 ^ x) buckets and we 
> have
> + * divided the allocation process into 4 equal phases. This macro returns 
> value
> + * from 0 to 3.
> + */
>
> +#define SPLITPOINT_PHASES_WITHIN_GROUP(sp_g, bucket_num) \
> + (((bucket_num) >> (sp_g - SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE)) & \
> + SPLITPOINT_PHASE_MASK)
> This won't work if we change SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE to
> number other than 3.  I think you should change it so that it can work
> with any value of  SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE.

Fixed, using SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE was accidental. All
I need is most significant 3 bits hence should be subtracted by 3
always.

> I think you should name this define as SPLITPOINT_PHASE_WITHIN_GROUP
> as this refers to only one particular phase within group.

Fixed.

Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


yet_another_expand_hashbucket_efficiently_10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-28 Thread Mithun Cy
On Tue, Mar 28, 2017 at 12:21 PM, Amit Kapila  wrote:
> I think both way it can work.  I feel there is no hard pressed need
> that we should make the computation in sorting same as what you do
> _hash_doinsert. In patch_B,  I don't think new naming for variables is
> good.
>
>   Assert(!a->isnull1);
> - hash1 = DatumGetUInt32(a->datum1) & state->hash_mask;
> + bucket1 = DatumGetUInt32(a->datum1) % state->num_buckets;
>
> Can we use hash_mod instead of num_buckets and retain hash1 in above
> code and similar other places?

Yes done renamed it to hash_mod.

>
> Few comments:
> 1.
> +#define BUCKETS_WITHIN_SP_GRP(sp_g, nphase) \
> + ((sp_g < SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE) ? \
> + (1 << (sp_g - 1)) : \
> + ((nphase) * ((1 << (sp_g - 1)) >> 2)))
>
> This will go wrong for split point group zero.  In general, I feel if
> you handle computation for split groups lesser than
> SPLITPOINT_GROUPS_WITH_ONLY_ONE_PHASE in the caller, then all your
> macros will become much simpler and less error prone.

Fixed, apart from SPLITPOINT_PHASE_TO_SPLITPOINT_GRP rest all macros
only handle multi phase group. The SPLITPOINT_PHASE_TO_SPLITPOINT_GRP
is used in one more place at expand index so thought kepeping it as it
is is better.
.
> 2.
> +#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) (_hash_log2(num_bucket))
>
> What is the use of such a define, can't we directly use _hash_log2 in
> the caller?

No real reason, just that NAMED macro appeared more readable than just
_hash_log2. Now I have removed same.

> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


yet_another_expand_hashbucket_efficiently_09.patch
Description: Binary data


sortbuild_hash_B_2.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-27 Thread Mithun Cy
On Mon, Mar 27, 2017 at 11:21 AM, Amit Kapila  wrote:

> I think we can't change the number of buckets to be created or lowmask
> and highmask calculation here without modifying _h_spoolinit() because
> it sorts the data to be inserted based on hashkey which in turn
> depends on the number of buckets that we are going to create during
> create index operation.  We either need to allow create index
> operation to still always create buckets in power-of-two fashion or we
> need to update _h_spoolinit according to new computation.  One minor
> drawback of using power-of-two scheme for creation of buckets during
> create index is that it can lead to wastage of space and will be
> inconsistent with what the patch does during split operation.

Yes, this was a miss. Now Number of buckets allocated during
metap_init is not always a power-of-two number. The hashbuild which
uses just the hash_mask to decide which bucket does the hashkey
belong to is not sufficient. It can give buckets beyond max_buckets
and sorting of index values based on their buckets will be out of
order. When we try to actually insert the same in hash index we loose
the advantage of the spatial locality which existed before. And, hence
indexbuild performance can reduce.

As you have said we can solve it if we allocate buckets always in
power-of-2 when we do hash index meta page init. But on other
occasions, when we try to double the existing buckets we can do the
allocation in 4 equal phases.

But I think there are 2 more ways to solve same,

A. Why not pass all 3 parameters high_mask, low_mask, max-buckets to
tuplesort and let them use _hash_hashkey2bucket to figure out which
key belong to which bucket. And then sort them. I think this way we
make both sorting and insertion to hash index both consistent with
each other.

B. In tuple sort we can use hash function bucket = hash_key %
num_buckets instead of existing one which does bitwise "and" to
determine the bucket of hash key. This way we will not wrongly assign
buckets beyond max_buckets and sorted hash keys will be in sync with
actual insertion order of _hash_doinsert.


I am adding both the patches Patch_A and Patch_B. My preference is
Patch_A and I am open for suggestion.

>+#define SPLITPOINT_PHASES_PER_GRP 4
>+#define SPLITPOINT_PHASE_MASK (SPLITPOINT_PHASES_PER_GRP - 1)
>+#define Buckets_First_Split_Group 4
Fixed.

>In the above computation +2 and -2 still bothers me.  I think you need
>to do this because you have defined split group zero to have four
>buckets, how about if you don't force that and rather define to have
>split point phases only from split point which has four or more
>buckets?

Okay as suggested instead of group zero having 4 phases of 1 bucket
each, I have recalculated the spare mapping as below.
Allocating huge chunks of bucket pages all at once isn't optimal and
we will take ages to consume those. To avoid this exponential growth
of index size, we did use a trick to breakup allocation of buckets at
the splitpoint into 4 equal phases.  If (2 ^ x) is the total buckets
need to be allocated at a splitpoint (from now on we shall call this
as a splitpoint group), then we allocate 1/4th (2 ^ (x - 2)) of total
buckets at each phase of splitpoint group. Next quarter of allocation
will only happen if buckets of the previous phase have been already
consumed.  Since for buckets number < 4 we cannot further divide it
into multiple phases, the first 3 group will have only one phase of
allocation. The groups 0, 1, 2 will allocate 1, 1, 2 buckets
respectively at once in one phase. For the groups > 2 Where we
allocate buckets > 4, the allocation process is distributed among four
equal phases. At group 3 we allocate 4 buckets in 4 different phases
{1, 1, 1, 1}, the numbers in curly braces indicate number of buckets
allocated within each phase of splitpoint group 3. And, for splitpoint
group 4 and 5 allocation phase will be {2, 2, 2, 2} = 16 buckets in
total and {4, 4, 4, 4} = 32 buckets in total.  We can see that at each
splitpoint group
we double the total number of buckets from previous group but in an
incremental phase.  The bucket pages allocated within one phase of a
splitpoint group will appear consecutively in the index.


The sortbuild_hash_*.patch can be applied independently on any of
expand_hashbucket_efficiently_08.patch
-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


sortbuild_hash_B.patch
Description: Binary data


sortbuild_hash_A.patch
Description: Binary data


yet_another_expand_hashbucket_efficiently_08.patch
Description: Binary data


expand_hashbucket_efficiently_08.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-25 Thread Mithun Cy
Thanks, Amit for the review.
On Sat, Mar 25, 2017 at 7:03 PM, Amit Kapila  wrote:
>
> I think one-dimensional patch has fewer places to touch, so that looks
> better to me.  However, I think there is still hard coding and
> assumptions in code which we should try to improve.

Great!, I will continue with spares 1-dimensional improvement.

>
> 1.
> + /*
> + * The first 4 bucket belongs to first splitpoint group 0. And since group
> + * 0 have 4 = 2^2 buckets, we double them in group 1. So total buckets
> + * after group 1 is 8 = 2^3. Then again at group 2 we add another 2^3
> + * buckets to double the total number of buckets, which will become 2^4. I
> + * think by this time we can see a pattern which say if num_bucket > 4
> + * splitpoint group = log2(num_bucket) - 2
> + */
>
> + if (num_bucket <= 4)
> + splitpoint_group = 0; /* converted to base 0. */
> + else
> + splitpoint_group = _hash_log2(num_bucket) - 2;
>
> This patch defines split point group zero has four buckets and based
> on that above calculation is done.  I feel you can define it like
> #define Buckets_First_Split_Group 4 and then use it in above code.
> Also, in else part number 2 looks awkward, can we define it as
> log2_buckets_first_group = _hash_log2(Buckets_First_Split_Group) or
> some thing like that.  I think that way code will look neat.  I don't
> like the way above comment is worded even though it is helpful to
> understand the calculation.  If you want, then you can add such a
> comment in file header, here it looks out of place.

I have removed the comments. And, defined a new macro which maps
bucket to SPLIT GROUP

#define BUCKET_TO_SPLITPOINT_GRP(num_bucket) \
((num_bucket <= Buckets_First_Split_Group) ? 0 : \
(_hash_log2(num_bucket) - 2))

I could not find a way to explain why minus 2? better than " The
splitpoint group of a given bucket can be taken as
(_hash_log2(bucket) - 2). Subtracted by 2 because each group have 2 ^
(x + 2) buckets.". Now I have added those with existing comments I
think that should make it little better.

Adding comments about spares array in hashutils.c's file header did
not appear right to me. I think README has some details about same.

> 2.
> +power-of-2 groups, called "split points" in the code.  That means on every 
> new
> +split points we double the existing number of buckets.
>
> split points/split point

Fixed.

> 3.
> + * which phase of allocation the bucket_num belogs to with in the group.
>
> /belogs/belongs

Fixed

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


expand_hashbucket_efficiently_07.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-03-25 Thread Mithun Cy
On Sat, Mar 25, 2017 at 11:46 PM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:
> On 3/13/17 09:15, Mithun Cy wrote:
>> A. launch_autoprewarm_dump() RETURNS int4
>> This is a SQL callable function to launch the autoprewarm worker to
>> dump the buffer pool information at regular interval. In a server, we
>> can only run one autoprewarm worker so if a worker sees another
>> existing worker it will exit immediately. The return value is pid of
>> the worker which has been launched.
>
> Why do you need that?

To launch an autoprewarm worker we have to preload the liberary which
need a server restart. If we want to start periodic dumping on an
already running server so that it can automatically prewarm on its
next restart, this API can be used to launch the autoprewarm.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-24 Thread Mithun Cy
On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila  wrote:
> Sure, I was telling you based on that.  If you are implicitly treating
> it as 2-dimensional array, it might be easier to compute the array
>offsets.

I think calculating spares offset will not become anyway much simpler
we still need to calculate split group and split phase separately. I
have added a patch to show how a 2-dimensional spares code looks like
and where all we need changes.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


expand_hashbucket_efficiently_06_spares_2dimesion.patch
Description: Binary data


expand_hashbucket_efficiently_06_spares_1dimension.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-24 Thread Mithun Cy
On Fri, Mar 24, 2017 at 1:22 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> Hi Amit please find the new patch

The pageinspect.sgml has an example which shows the output of
"hash_metapage_info()". Since we increase the spares array and
eventually ovflpoint, I have updated the example with corresponding
values..


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


expand_hashbucket_efficiently_05.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-23 Thread Mithun Cy
Hi Amit please find the new patch

On Tue, Mar 21, 2017 at 8:16 AM, Amit Kapila  wrote:
> It is not only about above calculation, but also what the patch is
> doing in function _hash_get_tbuckets().  By the way function name also
> seems unclear (mainly *tbuckets* in the name).

Fixed I have introduced some macros for readability and added more
comments to explain why some calculations are mad. Please let me know
if you think more improvements are needed.

>spelling.
>/forth/fourth
>/at time/at a time
Thanks fixed.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


expand_hashbucket_efficiently_04.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-23 Thread Mithun Cy
Hi Pavan,
On Thu, Mar 23, 2017 at 12:19 AM, Pavan Deolasee
 wrote:
> Ok, no problem. I did some tests on AWS i2.xlarge instance (4 vCPU, 30GB
> RAM, attached SSD) and results are shown below. But I think it is important
> to get independent validation from your side too, just to ensure I am not
> making any mistake in measurement. I've attached naively put together
> scripts which I used to run the benchmark. If you find them useful, please
> adjust the paths and run on your machine.

I did a similar test appears. Your v19 looks fine to me, it does not
cause any regression, On the other hand, I also ran tests reducing
table fillfactor to 80 there I can see a small regression 2-3% in
average when updating col2 and on updating col9 again I do not see any
regression.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


WARM_test_02.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Mithun Cy
On Thu, Mar 23, 2017 at 12:19 AM, Pavan Deolasee
<pavan.deola...@gmail.com> wrote:
>
>
> On Wed, Mar 22, 2017 at 4:53 PM, Mithun Cy <mithun...@enterprisedb.com>
> wrote:
> Ok, no problem. I did some tests on AWS i2.xlarge instance (4 vCPU, 30GB
> RAM, attached SSD) and results are shown below. But I think it is important
> to get independent validation from your side too, just to ensure I am not
> making any mistake in measurement. I've attached naively put together
> scripts which I used to run the benchmark. If you find them useful, please
> adjust the paths and run on your machine.

Looking at your postgresql.conf  JFYI, I have synchronous_commit = off
but same is on with your run (for logged tables) and rest remains
same. Once I get the machine probably next morning, I will run same
tests on v19.
-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Mithun Cy
On Wed, Mar 22, 2017 at 8:43 AM, Pavan Deolasee
 wrote:
> Sorry, I did not mean to suggest that you set it up wrongly, I was just
> trying to point out that the test case itself may not be very practical.
That is cool np!, I was just trying to explain why those tests were
made if others wondered about it.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible regression with gather merge.

2017-03-22 Thread Mithun Cy
On Wed, Mar 22, 2017 at 1:09 PM, Rushabh Lathia
 wrote:
> In the code, gather merge consider the limit for the sort into
> create_ordered_paths,
> which is wrong. Into create_ordered_paths(), GM should not consider the
> limit
> while doing costing for the sort node.
>
> Attached patch fix the bug.
Thanks, Rushabh, Now I see non-parallel scan being picked up by default

create table test as (select id, (random()*1)::int as v1, random() as
v2 from generate_series(1,1000) id);

postgres=# explain analyze select * from test order by v1, v2 limit 10;
  QUERY PLAN
---
 Limit  (cost=370146.98..370147.00 rows=10 width=16) (actual
time=1169.685..1169.686 rows=10 loops=1)
   ->  Sort  (cost=370146.98..395146.63 rows=860 width=16) (actual
time=1169.683..1169.684 rows=10 loops=1)
 Sort Key: v1, v2
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on test  (cost=0.00..154053.60 rows=860
width=16) (actual time=0.016..590.176 rows=1000 loops=1)
 Planning time: 0.070 ms
 Execution time: 1169.706 ms
(7 rows)


Another find by accident. Setting higher
max_parallel_workers_per_gather to a higher value than default results
in more parallel workers, But query runs slower than the plan with
lesser workers.

postgres=# set max_parallel_workers_per_gather = default;
SET
postgres=# explain analyze select * from test order by v1, v2 limit 1000;
 QUERY
PLAN

 Limit  (cost=697263.86..1669540.28 rows=8333216 width=16) (actual
time=2212.437..8207.161 rows=1000 loops=1)
   ->  Gather Merge  (cost=697263.86..1669540.28 rows=8333216
width=16) (actual time=2212.436..7600.478 rows=1000 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Sort  (cost=696263.84..706680.36 rows=4166608 width=16)
(actual time=2173.756..3105.512 rows=333 loops=3)
   Sort Key: v1, v2
   Sort Method: external merge  Disk: 86648kB
   ->  Parallel Seq Scan on test  (cost=0.00..95721.08
rows=4166608 width=16) (actual time=0.030..240.486 rows=333
loops=3)
 Planning time: 0.096 ms
 Execution time: 8537.214 ms
(10 rows)

postgres=# set max_parallel_workers_per_gather = 10;
SET
postgres=# explain analyze select * from test order by v1, v2 limit 1000;
 QUERY
PLAN

 Limit  (cost=431168.44..1628498.09 rows=860 width=16) (actual
time=1525.120..13273.805 rows=1000 loops=1)
   ->  Gather Merge  (cost=431168.44..1628498.09 rows=860
width=16) (actual time=1525.119..12650.621 rows=1000 loops=1)
 Workers Planned: 4
 Workers Launched: 4
 ->  Sort  (cost=430168.39..436418.30 rows=2499965 width=16)
(actual time=1472.799..2133.571 rows=200 loops=5)
   Sort Key: v1, v2
   Sort Method: external merge  Disk: 50336kB
   ->  Parallel Seq Scan on test  (cost=0.00..79054.65
rows=2499965 width=16) (actual time=0.047..201.405 rows=200
loops=5)
 Planning time: 0.077 ms
 Execution time: 13622.319 ms
(10 rows)



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-22 Thread Mithun Cy
On Wed, Mar 22, 2017 at 3:44 PM, Pavan Deolasee
 wrote:
>
> This looks quite weird to me. Obviously these numbers are completely
> non-comparable. Even the time for VACUUM FULL goes up with every run.
>
> May be we can blame it on AWS instance completely, but the pattern in your
> tests looks very similar where the number slowly and steadily keeps going
> up. If you do complete retest but run v18/v19 first and then run master, may
> be we'll see a complete opposite picture?
>

For those tests I have done tests in the order ---  both the time numbers were same. One different thing
I did was I was deleting the data directory between tests and creating
the database from scratch. Unfortunately the machine I tested this is
not available. I will test same with v19 once I get the machine and
report you back.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Possible regression with gather merge.

2017-03-22 Thread Mithun Cy
Adding more rows to table make gather merge execution time very slow
when compared to non-parallel plan we get after disabling gather
merge.

create table test as (select id, (random()*1)::int as v1, random() as
v2 from generate_series(1,1) id);

postgres=# set max_parallel_workers_per_gather = default;
SET
postgres=# explain analyze select * from test order by v1, v2 limit 10;

QUERY PLAN

 Limit  (cost=1858610.53..1858611.70 rows=10 width=16) (actual
time=31103.880..31103.885 rows=10 loops=1)
   ->  Gather Merge  (cost=1858610.53..11581520.05 rows=8406
width=16) (actual time=31103.878..31103.882 rows=10 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Sort  (cost=1857610.50..1961777.26 rows=41666703
width=16) (actual time=30560.865..30561.046 rows=911 loops=3)
   Sort Key: v1, v2
   Sort Method: external merge  Disk: 841584kB
   ->  Parallel Seq Scan on test  (cost=0.00..957208.03
rows=41666703 width=16) (actual time=0.050..2330.275 rows=
loops=3)
 Planning time: 0.292 ms
 Execution time: 31502.896 ms
(10 rows)

postgres=# set max_parallel_workers_per_gather = 0;
SET
postgres=# explain analyze select * from test order by v1, v2 limit 10;
 QUERY
PLAN

 Limit  (cost=3701507.83..3701507.85 rows=10 width=16) (actual
time=13231.264..13231.266 rows=10 loops=1)
   ->  Sort  (cost=3701507.83..3951508.05 rows=10088 width=16)
(actual time=13231.261..13231.262 rows=10 loops=1)
 Sort Key: v1, v2
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on test  (cost=0.00..1540541.88 rows=10088
width=16) (actual time=0.045..6759.363 rows=1 loops=1)
 Planning time: 0.131 ms
 Execution time: 13231.299 ms
(7 rows)

On Wed, Mar 22, 2017 at 11:07 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> I accidently encountered a case where gather merge was picked as
> default but disabling same by setting max_parallel_workers_per_gather
> = 0; produced a non-parallel plan which was faster than gather merge,
> but its cost is marked too high when compared to gather merge.
>
> I guess we need some cost adjustment is planner code.
>
> Test setting
> =
> create table test as (select id, (random()*1)::int as v1, random() as
> v2 from generate_series(1,100) id);
> create index test_v1_idx on test (v1);
>
>
> Server setting is default.
>
>
> postgres=# explain analyze select * from test order by v1, v2 limit 10;
>QUERY
> PLAN
> 
>  Limit  (cost=19576.71..19577.88 rows=10 width=16) (actual
> time=265.989..265.995 rows=10 loops=1)
>->  Gather Merge  (cost=19576.71..116805.80 rows=84 width=16)
> (actual time=265.987..265.992 rows=10 loops=1)
>  Workers Planned: 2
>  Workers Launched: 2
>  ->  Sort  (cost=18576.69..19618.36 rows=416667 width=16)
> (actual time=250.202..250.424 rows=911 loops=3)
>Sort Key: v1, v2
>Sort Method: external merge  Disk: 9272kB
>->  Parallel Seq Scan on test  (cost=0.00..9572.67
> rows=416667 width=16) (actual time=0.053..41.397 rows=33 loops=3)
>  Planning time: 0.193 ms
>  Execution time: 271.222 ms
>
> postgres=# set max_parallel_workers_per_gather = 0;
> SET
> postgres=# explain analyze select * from test order by v1, v2 limit 10;
>  QUERY PLAN
> -
>  Limit  (cost=37015.64..37015.67 rows=10 width=16) (actual
> time=211.582..211.584 rows=10 loops=1)
>->  Sort  (cost=37015.64..39515.64 rows=100 width=16) (actual
> time=211.581..211.582 rows=10 loops=1)
>  Sort Key: v1, v2
>  Sort Method: top-N heapsort  Memory: 25kB
>  ->  Seq Scan on test  (cost=0.00..15406.00 rows=100
> width=16) (actual time=0.085..107.522 rows=100 loops=1)
>  Planning time: 0.093 ms
>  Execution time: 211.608 ms
> (7 rows)
>
>
>
> --
> Thanks and Regards
> Mithun C Y
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Possible regression with gather merge.

2017-03-21 Thread Mithun Cy
I accidently encountered a case where gather merge was picked as
default but disabling same by setting max_parallel_workers_per_gather
= 0; produced a non-parallel plan which was faster than gather merge,
but its cost is marked too high when compared to gather merge.

I guess we need some cost adjustment is planner code.

Test setting
=
create table test as (select id, (random()*1)::int as v1, random() as
v2 from generate_series(1,100) id);
create index test_v1_idx on test (v1);


Server setting is default.


postgres=# explain analyze select * from test order by v1, v2 limit 10;
   QUERY
PLAN

 Limit  (cost=19576.71..19577.88 rows=10 width=16) (actual
time=265.989..265.995 rows=10 loops=1)
   ->  Gather Merge  (cost=19576.71..116805.80 rows=84 width=16)
(actual time=265.987..265.992 rows=10 loops=1)
 Workers Planned: 2
 Workers Launched: 2
 ->  Sort  (cost=18576.69..19618.36 rows=416667 width=16)
(actual time=250.202..250.424 rows=911 loops=3)
   Sort Key: v1, v2
   Sort Method: external merge  Disk: 9272kB
   ->  Parallel Seq Scan on test  (cost=0.00..9572.67
rows=416667 width=16) (actual time=0.053..41.397 rows=33 loops=3)
 Planning time: 0.193 ms
 Execution time: 271.222 ms

postgres=# set max_parallel_workers_per_gather = 0;
SET
postgres=# explain analyze select * from test order by v1, v2 limit 10;
 QUERY PLAN
-
 Limit  (cost=37015.64..37015.67 rows=10 width=16) (actual
time=211.582..211.584 rows=10 loops=1)
   ->  Sort  (cost=37015.64..39515.64 rows=100 width=16) (actual
time=211.581..211.582 rows=10 loops=1)
 Sort Key: v1, v2
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on test  (cost=0.00..15406.00 rows=100
width=16) (actual time=0.085..107.522 rows=100 loops=1)
 Planning time: 0.093 ms
 Execution time: 211.608 ms
(7 rows)



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Patch: Write Amplification Reduction Method (WARM)

2017-03-21 Thread Mithun Cy
On Tue, Mar 21, 2017 at 8:10 PM, Robert Haas  wrote:
> If the WAL writing hides the loss, then I agree that's not a big
> concern.  But if the loss is still visible even when WAL is written,
> then I'm not so sure.

The tests table schema was taken from earlier tests what Pavan has posted
[1], hence it is UNLOGGED all I tried to stress the tests. Instead of
updating 1 row at a time through pgbench (For which I and Pavan both did
not see any regression), I tried to update all the rows in the single
statement. I have changed the settings as recommended and did a quick test
as above in our machine by removing UNLOGGED world in create table
statement.

Patch Tested : Only 0001_interesting_attrs_v18.patch in [2]

Machine: Scylla [ Last time I did same tests on IBM power2 but It is not
immediately available. So trying on another intel based performance
machine.]

[mithun.cy@scylla bin]$ lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):56
On-line CPU(s) list:   0-55
Thread(s) per core:2
Core(s) per socket:14
Socket(s): 2
NUMA node(s):  2
Vendor ID: GenuineIntel
CPU family:6
Model: 63
Model name:Intel(R) Xeon(R) CPU E5-2695 v3 @ 2.30GHz
Stepping:  2
CPU MHz:   1235.800
BogoMIPS:  4594.35
Virtualization:VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  35840K
NUMA node0 CPU(s): 0-13,28-41
NUMA node1 CPU(s): 14-27,42-55

[mithun.cy@scylla bin]$ cat /proc/meminfo
MemTotal:   65687464 kB


Postgresql.conf non default settings
===
shared_buffers  = 24 GB
max_wal_size = 10GB
min_wal_size = 5GB
synchronous_commit=off
autovacuum = off  /*manually doing vacumm full before every update. */

This system has 2 storage I have kept datadir on spinning disc and pg_wal
on ssd.

Tests :

DROP TABLE IF EXISTS testtab;

CREATE TABLE testtab (

col1 integer,

col2 text,

col3 float,

col4 text,

col5 text,

col6 char(30),

col7 text,

col8 date,

col9 text,

col10 text

);

INSERT INTO testtab

SELECT generate_series(1,1000),

md5(random()::text),

random(),

md5(random()::text),

md5(random()::text),

md5(random()::text)::char(30),

md5(random()::text),

now(),

md5(random()::text),

md5(random()::text);

CREATE INDEX testindx ON testtab (col1, col2, col3, col4, col5, col6, col7,
col8, col9);
Performance measurement tests: Ran12 times to eliminate run to run
latencies.
==
VACUUM FULL;
BEGIN;
UPDATE testtab SET col2 = md5(random()::text);
ROLLBACK;

Response time recorded shows there is a much higher increase in response
time from 10% to 25% after the patch.


[1] Re: rewrite HeapSatisfiesHOTAndKey

[2] Re: Patch: Write Amplification Reduction Method (WARM)

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


WARM_test.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-20 Thread Mithun Cy
Hi Amit, Thanks for the review,

On Mon, Mar 20, 2017 at 5:17 PM, Amit Kapila  wrote:
> idea could be to make hashm_spares a two-dimensional array
> hashm_spares[32][4] where the first dimension will indicate the split
> point and second will indicate the sub-split number.  I am not sure
> whether it will be simpler or complex than the method used in the
> proposed patch, but I think we should think a bit more to see if we
> can come up with some simple technique to solve this problem.

I think making it a 2-dimensional array will not be any useful in fact
we really treat the given array 2-dimensional elements now.
The main concern of yours I think is the calculation steps to find the
phase of the splitpoint group the bucket belongs to.
+ tbuckets = (1 << (splitpoint_group + 2));
+ phases_beyond_bucket =
+ (tbuckets - num_bucket) / (1 << (splitpoint_group - 1));
+ return (((splitpoint_group + 1) << 2) - phases_beyond_bucket) - 1;

Quickly thinking further we allocate 2^x of buckets on each phase of
any splitpoint group and we have 4 such phases in each group; At each
splitpoint group again we allocate 2^y buckets, So buckets number have
a pattern in their bitmap representation to find out which phase of
allocation they belong to.


As below
===
Group 0 -- bit 0, 1 define which phase of group each bucket belong to.
0 -- 
1 -- 0001
2 -- 0010
3 -- 0011
===
Group 1 -- bit 0, 1 define which phase of group each bucket belong to.
4 -- 0100
5 -- 0101
6 -- 0110
7 -- 0111
===
Group 2 -- bit 1, 2 define which phase of group each bucket belong to.
8 -- 1000
9 -- 1001
10 -- 1010
11 -- 1011
12 -- 1100
13 -- 1101
14 -- 1110
15 -- 
===
Group 3 -- bit 2, 3 define which phase of group each bucket belong to.
16 -- 0001
17 -- 00010001
18 -- 00010010
19 -- 00010011
20 -- 00010100
21 -- 00010101
22 -- 00010110
23 -- 00010111
24 -- 00011000
25 -- 00011001
26 -- 00011010
27 -- 00011011
28 -- 00011100
29 -- 00011101
30 -- 0000
31 -- 0001


So we can say given a bucket x of group n > 0 bits (n-1, n) defines
which phase of group they belong to. I see an opportunity here to
completely simplify the above calculation into a simple bitwise anding
the bucket number with (n-1,n) bitmask to get the phase of allocation
of bucket x.

Formula can be (x >> (splitpoint_group - 1)) & 0x3 = phase of bucket

Does this satisfy your concern?

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [PATCH] Incremental sort (was: PoC: Partial sort)

2017-03-20 Thread Mithun Cy
On Mon, Feb 27, 2017 at 8:29 PM, Alexander Korotkov
 wrote:

This patch needs to be rebased.

1. It fails while applying as below

patching file src/test/regress/expected/sysviews.out
Hunk #1 FAILED at 70.
1 out of 1 hunk FAILED -- saving rejects to file
src/test/regress/expected/sysviews.out.rej
patching file src/test/regress/sql/inherit.sql

2. Also, there are compilation errors due to new commits.

-fwrapv -fexcess-precision=standard -O2 -I../../../../src/include
-D_GNU_SOURCE   -c -o createplan.o createplan.c
createplan.c: In function ‘create_gather_merge_plan’:
createplan.c:1510:11: warning: passing argument 3 of ‘make_sort’ makes
integer from pointer without a cast [enabled by default]
   gm_plan->nullsFirst);
   ^
createplan.c:235:14: note: expected ‘int’ but argument is of type ‘AttrNumber *’
 static Sort *make_sort(Plan *lefttree, int numCols, int skipCols,
  ^
createplan.c:1510:11: warning: passing argument 4 of ‘make_sort’ from
incompatible pointer type [enabled by default]
   gm_plan->nullsFirst);

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-03-18 Thread Mithun Cy
On Thu, Mar 16, 2017 at 10:55 PM, David Steele  wrote:
> It does apply with fuzz on 2b32ac2, so it looks like c11453c and
> subsequent commits are the cause.  They represent a fairly substantial
> change to hash indexes by introducing WAL logging so I think you should
> reevaluate your patches to be sure they still function as expected.

Thanks, David here is the new improved patch I have also corrected the
pageinspect's test output. Also, added notes in README regarding the
new way of adding bucket pages efficiently in hash index. I also did
some more tests pgbench read only and read write;
There is no performance impact due to the patch. The growth of index
size has become much efficient as the numbers posted in the initial
proposal mail.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


expand_hashbucket_efficiently_03.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-03-13 Thread Mithun Cy
>> - The error handling loop around load_block() suggests that you're
>> expecting some reads to fail, which I guess is because you could be
>> trying to read blocks from a relation that's been rewritten under a
>> different relfilenode, or partially or entirely truncated.  But I
>> don't think it's very reasonable to just let ReadBufferWhatever() spew
>> error messages into the log and hope users don't panic.  People will
>> expect an automatic prewarm solution to handle any such cases quietly,
>> not bleat loudly in the log.  I suspect that this error-trapping code
>> isn't entirely correct, but there's not much point in fixing it; what
>> we really need to do is get rid of it (somehow).
>
> [Need Reelook] -- Debug and check if block load fails what will happen.

Oops Sorry, this was for self-reference. It is fixed now

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-03-13 Thread Mithun Cy
Hi all, thanks for the feedback. Based on your recent comments I have
implemented a new patch which is attached below,

On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas  wrote:
> This is not easy to fix.  The lock has to be taken based on the
> relation OID, not the relfilenode, but we don't have the relation OID
> in the dump file, and recording it there won't help, because the
> relfilenode can change under us if the relation is rewritten with
> CLUSTER or VACUUM FULL or relevant forms of ALTER TABLE.  I don't see
> a solution other than launching a separate worker for each database,
> which seems like it could be extremely expensive if there are many
> databases.  Also, I am pretty sure it's no good to take locks before
> recovery reaches a consistent state.  I'm not sure off-hand whether
> crash-recovery will notice conflicting locks, but even if it does,
> blocking crash recovery in order to prewarm stuff is bad news.

On Mon, Feb 27, 2017 at 7:18 PM, Peter Eisentraut
 wrote:
> You don't have to start all these workers at once.  Starting one and not
> starting the next one until the first one is finished should be fine.
> It will have the same serial behavior that the patch is proposing anyway.

I have implemented a similar logic now. The prewarm bgworker will
launch a sub-worker per database in the dump file. And, each
sub-worker will load its database block info. The sub-workers will be
launched only after previous one is finished. All of this will only
start if the database has reached a consistent state.

I have also introduced 2 more utility functions which were requested earlier.
A. launch_autoprewarm_dump() RETURNS int4
This is a SQL callable function to launch the autoprewarm worker to
dump the buffer pool information at regular interval. In a server, we
can only run one autoprewarm worker so if a worker sees another
existing worker it will exit immediately. The return value is pid of
the worker which has been launched.

B. autoprewarm_dump_now() RETURNS int8
This is a SQL callable function to dump buffer pool information
immediately once by a backend. This can work in parallel with an
autoprewarm worker while it is dumping. The return value is the number
of blocks info dumped.

I need some feedback on efficiently dividing the blocks among
sub-workers. Existing dump format will not be much useful.

I have chosen the following format
Each entry of block info looks like this:
 and we shall
call it as BlockInfoRecord.

Contents of AUTOPREWARM_FILE has been formated such a way that
blockInfoRecord of each database can be given to different prewarm
workers.
format of AUTOPREWAM_FILE
===
[offset position of database map table]
[sorted BlockInfoRecords..]
[database map table]
===

The [database map table] is a sequence of offset in file which will
point to first BlockInfoRecords of each database in the dump. The
prewarm worker will read this offset one by one in sequence and ask
its sub-worker to seek to this position and then start loading the
BlockInfoRecords one by one until it sees a BlockInfoRecords of a
different database than it is actually connected to. NOTE: We store
off_t inside file so the dump file will not be portable to be used
across systems where sizeof off_t is different from each other.

I also thought of having one dump file per database. Problems I
thought which can go against it is there could be too many dump file
(also stale files of databases which are no longer in buffer pool).
Also, there is a possibility of dump file getting lost due to
concurrent dumps by bgworker and autoprewarm_dump_now() SQL utility.
ex: While implementing same, dump file names were chosen as number
sequence 0,1,2,3..number_of_db. (This helps to avoid stale files
being chosen before active database dump files). If 2 concurrent
process dump together there will be a race condition. If one page of
the new database is loaded to buffer pool at that point there could be
a possibility that a recent dump of database blocks will be
overwritten due to the race condition and it will go missing from
dumps even though that database is still active in bufferpool. If any
comments to fix this will be very helpful.

On Sun, Feb 26, 2017 at 10:16 PM, Robert Haas  wrote:
> Here are some other review comments (which may not matter unless we
> can think up a solution to the problems above).

> - I think auto_pg_prewarm.c is an unnecessarily long name.  How about
> autoprewarm.c?

Fixed; I am also trying to replace the term "auto pg_prewarm" to
"autoprewarm" in all relevant places.

> - It looks like you ran pgindent over this without adding the new
> typedefs to your typedefs.list, so things like the last line of each
> typedef is formatted incorrectly.

Fixed.

> - ReadBufferForPrewarm isn't a good name for 

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-02-22 Thread Mithun Cy
Hi all thanks,
I have tried to fix all of the comments given above with some more
code cleanups.

On Wed, Feb 22, 2017 at 6:28 AM, Robert Haas  wrote:
> I think it's OK to treat that as something of a corner case.  There's
> nothing to keep you from doing that today: just use pg_buffercache to
> dump a list of blocks on one server, and then pass those blocks to
> pg_prewarm on another server.  The point of this patch, AIUI, is to
> automate a particularly common case of that, which is to dump before
> shutdown and load upon startup.  It doesn't preclude other things that
> people want to do.
>
> I suppose we could have an SQL-callable function that does an
> immediate dump (without starting a worker).  That wouldn't hurt
> anything, and might be useful in a case like the one you mention.

In the latest patch, I have moved the things back as in old ways there
will be one bgworker "auto pg_prewarm" which automatically records
information about blocks which were present in buffer pool before
server shutdown and then prewarm the buffer pool upon server restart
with those blocks. I have reverted back the code which helped us to
launch the stopped "auto pg_prewarm" bgworker. The reason I introduced
a launcher SQL utility function is the running auto pg_prewarm can be
stopped by the user by setting dump_interval to -1. So if the user
wants to restart the stopped auto pg_prewarm(this time dump only to
prewarm on next restart), he can use that utility. The user can launch
the auto pg_prewarm to dump periodically while the server is still
running. If that was not the concern I think I misunderstood the
comments and overdid the design. So as a first patch I will keep the
things simple. Also,
using a separate process for prewarm and dump activity was a bad
design hence reverted back same. The auto pg_prewarm can only be
launched by preloading the library. And I can add additional
utilities, once we can formalize what is really needed out of it.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


pg_auto_prewarm_05.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [POC] A better way to expand hash indexes.

2017-02-21 Thread Mithun Cy
Thanks, Amit

On Mon, Feb 20, 2017 at 9:51 PM, Amit Kapila  wrote:
> How will high and lowmask calculations work in this new strategy?
> Till now they always work on doubling strategy and I don't see you
> have changed anything related to that code.  Check below places.

It is important that the mask has to be (2^x) -1, if we have to retain
the same hash map function. So mask variables will take same values as
before. Only place I think we need change is  _hash_metapinit();
unfortunately, I did not test for the case where we build the hash
index on already existing tuples. Now I have fixed in the latest
patch.


> Till now, we have worked hard for not changing the page format in a
> backward incompatible way, so it will be better if we could find some
> way of doing this without changing the meta page format in a backward
> incompatible way.

We are not adding any new variable/ deleting some, we increase the
size of hashm_spares and hence mapping functions should be adjusted.
The problem is the block allocation, and its management is based on
the fact that all of the buckets(will be 2^x in number) belonging to a
particular split-point is allocated at once and together. The
hashm_spares is used to record those allocations and that will be used
further by map functions to reach a particular block in the file. If
we want to change the way we allocate the buckets then hashm_spares
will change and hence mapping function. So I do not think we can avoid
incompatibility issue.

One thing I can think of is if we can increase the hashm_version of
hash index; then for old indexes, we can continue to use doubling
method and its mapping. For new indexes, we can use new way as above.

Have you considered to store some information in
> shared memory based on which we can decide how much percentage of
> buckets are allocated in current table half?  I think we might be able
> to construct this information after crash recovery as well.

I think all of above data has to be persistent. I am not able to
understand what should be/can be stored in shared buffers. Can you
please correct me if I am wrong?


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


expand_hashbucket_efficiently_02
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2017-02-21 Thread Mithun Cy
On Fri, Feb 10, 2017 at 1:06 AM, Robert Haas  wrote:

> Alright, committed with a little further hacking.

I did pull the latest code, and tried
Test:

create table t1(t int);
create index i1 on t1 using hash(t);
insert into t1 select generate_series(1, 1000);

postgres=# SELECT spares FROM hash_metapage_info(get_raw_page('i1', 0));
   spares

 {0,0,0,1,9,17,33,65,-127,1,1,0,1,-1,-1,-4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0}

spares are showing negative numbers; I think the wrong type has been
chosen, seems it is rounding at 127, spares are defined as following
uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* spare pages before each
splitpoint */

it should be always positive.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] [POC] A better way to expand hash indexes.

2017-02-17 Thread Mithun Cy
Hi all,

As of now, we expand the hash index by doubling the number of bucket
blocks. But unfortunately, those blocks will not be used immediately.
So I thought if we can differ bucket block allocation by some factor,
hash index size can grow much efficiently. I have written a POC patch
which does following things -
Say at overflow point 'i' we need to add new "x = 2^(i-1)" buckets as
per the old code, I think we can do this addition of buckets in a
controlled way. Instead of adding all of 'x' bucket blocks at a time,
the patch will add x/4 blocks at a time. And, once those blocks are
consumed, it adds next installment of x/4 blocks. And, this will
continue until all of 'x' blocks of the overflow point 'i' is
allocated. My test result shows index size grows in a more efficient
way with above patch.

Note: This patch is just a POC. It can have bugs and I have to change
comments in the code, and README which are related to new changes.

Test:
create table t1(t int);
create index i1 on t1 using hash(t);
And then, Incrementally add rows as below.
insert into t1 select generate_series(1, 1000);


recordsbase index  new patch
in million  --  size(MB) --  index size(MB)

10384  384

20768  768

301417   1161

401531   1531

502556   1788

602785   2273

702963   2709

80   30603061

90   5111 3575

100 5111 3575

To implement such an incremental addition of bucket blocks I have to
increase the size of array hashm_spares in meta page by four times.
Also, mapping methods which map a total number of buckets to a
split-point position of hashm_spares array, need to be changed. These
changes create backward compatibility issues.

Implementation Details in brief:
===
Each element of hashm_spares (we call overflow point) is expanded into
4 slots {0, 1, 2, 3}. If 'x' (a power2 number) is the total number of
buckets to be added before the overflow point we add only a quarter of
it (x/4) to each slot, once we have consumed previously added blocks
we add next quarter of buckets and so on.
As in old code new hashm_spares[i] stores total overflow pages
allocated between those bucket allocation.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


expand_hashbucket_efficiently_01
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-02-07 Thread Mithun Cy
On Tue, Feb 7, 2017 at 11:21 PM, Erik Rijkers  wrote:
> On 2017-02-07 18:41, Robert Haas wrote:
>>
>> Committed with some changes (which I noted in the commit message).

Thanks, Robert and all who have reviewed the patch and given their
valuable comments.

> This has caused a warning with gcc 6.20:
>
> hashpage.c: In function ‘_hash_getcachedmetap’:
> hashpage.c:1245:20: warning: ‘cache’ may be used uninitialized in this
> function [-Wmaybe-uninitialized]
> rel->rd_amcache = cache;
> ^~~

Yes, I also see the warning.  I think the compiler is not able to see
cache is always initialized and used under condition if
(rel->rd_amcache == NULL).
I think to make the compiler happy we can initialize the cache with
NULL when it is defined.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_metap_compiler_warning01
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-02-07 Thread Mithun Cy
On Tue, Feb 7, 2017 at 6:11 PM, Beena Emerson  wrote:
>  Yes it would be better to have only one pg_prewarm worker as the loader is
> idle for the entire server run time after the initial load activity of few
> secs.
Sorry, that is again a bug in the code. The code to handle SIGUSR1
somehow got deleted before I submitted patch_03 and I failed to notice
same.
As in the code loader bgworker is waiting on the latch to know the
status of dump bgworker. Actually, the loader bgworker should exit
right after launching the dump bgworker. I will try to fix this and
other comments given by you in my next patch.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-02-07 Thread Mithun Cy
Thanks Beena,

On Tue, Feb 7, 2017 at 4:46 PM, Beena Emerson  wrote:
> Few more comments:
>
> = Background worker messages:
>
> - Workers when launched, show messages like: "logical replication launcher
> started”, "autovacuum launcher started”. We should probably have a similar
> message to show that the pg_prewarm load and dump bgworker has started.

-- Thanks, I will add startup and shutdown message.

> - "auto pg_prewarm load: number of buffers to load x”, other messages show
> space before and after “:”, we should keep it consistent through out.
>

-- I think you are testing patch 03. The latest patch_04 have
corrected same. Can you please re-test it.

>
> = Action of -1.
> I thought we decided that interval value of -1 would mean that the auto
> prewarm worker will not be run at all. With current code, -1 is explained to
> mean it will not dump. I noticed that reloading with new option as -1 stops
> both the workers but restarting loads the data and then quits. Why does it
> allow loading with -1? Please explain this better in the documents.
>

-- '-1' means we do not want to dump at all. So we decide not to
continue with launched bgworker and it exits. As per your first
comment, if I register the startup and shutdown messages for auto
pg_prewarm I think it will look better. Will try to explain it in a
better way in documentation. The "auto pg_prewarm load" will not be
affected with dump_interval value. It will always start, load(prewarm)
and then exit.

>
> = launch_pg_prewarm_dump()

> =# SELECT launch_pg_prewarm_dump();
>  launch_pg_prewarm_dump
> 
>   53552
> (1 row)
>
> $ ps -ef | grep 53552
> b_emers+  53555   4391  0 16:21 pts/100:00:00 grep --color=auto 53552

-- If dump_interval = -1 "auto pg_prewarm dump" will exit immediately.

> = Function names
> - load_now could be better named as load_file, load_dumpfile or similar.
> - dump_now -> dump_buffer or  better?

I did choose load_now and dump_now to indicate we are doing it
immediately as invoking them was based on a timer/state. Probably we
can improve that but dump_buffer, load_file may not be the right
replacement.

>
> = Corrupt file
> if the dump file is corrupted, the system crashes and the prewarm bgworkers
> are not restarted. This needs to be handled better.
>
> WARNING:  terminating connection because of crash of another server process
> 2017-02-07 16:36:58.680 IST [54252] DETAIL:  The postmaster has commanded
> this server process to roll back the current transaction and exit, because
> another server process exited abnormally and possibly corrupted shared
> memory

--- Can you please paste you autopgprewarm.save file, I edited the
file manually to some illegal entry but did not see any crash.  Only
we failed to load as block number were invalid. Please share your
tests so that I can reproduce same.

> = Documentation
>
> I feel the documentation needs to be improved greatly.
>
> - The first para in pg_prewarm should mention the autoload feature too.
>
> - The new section should be named “The pg_prewarm autoload” or something
> better. "auto pg_prewarm bgworker” does not seem appropriate.  The
> configuration parameter should also be listed out clearly like in auth-delay
> page. The new function launch_pg_prewarm_dump() should be listed under
> Functions.

-- Thanks I will try to improve the documentation. And, put more details there.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-02-07 Thread Mithun Cy
On Tue, Feb 7, 2017 at 12:24 PM, Amit Kapila  wrote:
> On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson  
> wrote:
>> Are 2 workers required?
>>
>
> I think in the new design there is a provision of launching the worker
> dynamically to dump the buffers, so there seems to be a need of
> separate workers for loading and dumping the buffers.  However, there
> is no explanation in the patch or otherwise when and why this needs a
> pair of workers.  Also, if the dump interval is greater than zero,
> then do we really need to separately register a dynamic worker?

We have introduced a new value  -1 for pg_prewarm.dump_interval this
means we will not dump at all, At this state, I thought auto
pg_prewarm process need not run at all, so I coded to exit the auto
pg_prewarm immediately. But If the user decides to start the auto
pg_prewarm to dump only without restarting the server, I have
introduced a launcher function "launch_pg_prewarm_dump" to restart the
auto pg_prewarm only to dump. Since now we can launch worker only to
dump, I thought we can redistribute the code between two workers, one
which only does prewarm (load only) and another dumps periodically.
This helped me to modularize and reuse the code. So once load worker
has finished its job, it registers a dump worker and then exists.
But if max_worker_processes is not enough to launch the "auto
pg_prewarm dump" bgworker
We throw an error
2017-02-07 14:51:59.789 IST [50481] ERROR:  registering dynamic
bgworker "auto pg_prewarm dump" failed c
2017-02-07 14:51:59.789 IST [50481] HINT:  Consider increasing
configuration parameter "max_worker_processes".

Now thinking again instead of such error and then correcting same by
explicitly launching the auto pg_prewarm dump bgwroker through
launch_pg_prewarm_dump(), I can go back to original design where there
will be one worker which loads and then dumps periodically. And
launch_pg_prewarm_dump will relaunch dump only activity of that
worker. Does this sound good?

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-02-07 Thread Mithun Cy
On Tue, Feb 7, 2017 at 11:53 AM, Beena Emerson  wrote:
> launched by other  applications. Also with max_worker_processes = 2 and
> restart, the system crashes when the 2nd worker is not launched:
> 2017-02-07 11:36:39.132 IST [20573] LOG:  auto pg_prewarm load : number of
> buffers actually tried to load 64
> 2017-02-07 11:36:39.143 IST [18014] LOG:  worker process: auto pg_prewarm
> load (PID 20573) was terminated by signal 11: Segmentation fault

SEGFAULT was the coding mistake I have called the C-language function
directly without initializing the functioncallinfo. Thanks for
raising. Below patch fixes same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


pg_auto_prewarm_04.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-02-06 Thread Mithun Cy
Hi all,
Here is the new patch which fixes all of above comments, I changed the
design a bit now as below

What is it?
===
A pair of bgwrokers one which automatically dumps buffer pool's block
info at a given interval and another which loads those block into
buffer pool when
the server restarts.

How does it work?
=
When the shared library pg_prewarm is preloaded during server startup.
A bgworker "auto pg_prewarm load" is launched immediately after the
server is started. The bgworker will start loading blocks obtained
from block info entry
 in
$PGDATA/AUTO_PG_PREWARM_FILE, until there is a free buffer in the
buffer pool. This way we do not replace any new blocks which were
loaded either by the recovery process or the querying clients.

Once the "auto pg_prewarm load" bgworker has completed its job, it
will register a dynamic bgworker "auto pg_prewarm dump" which has to
be launched
when the server reaches a consistent state. The new bgworker will
periodically scan the buffer pool and then dump the meta info of
blocks
which are currently in the buffer pool. The GUC
pg_prewarm.dump_interval if set > 0 indicates the minimum time
interval between two dumps. If
pg_prewarm.dump_interval is set to AT_PWARM_DUMP_AT_SHUTDOWN_ONLY the
bgworker will only dump at the time of server shutdown. If it is set
to AT_PWARM_LOAD_ONLY we do not want the bgworker to dump anymore, so
it stops there.

To relaunch a stopped "auto pg_prewarm dump" bgworker we can use the
utility function "launch_pg_prewarm_dump".

==
One problem now I have kept it open is multiple "auto pg_prewarm dump"
can be launched even if already a dump/load worker is running by
calling launch_pg_prewarm_dump. I can avoid this by dropping a
lock-file before starting the bgworkers. But, if there is an another
method to avoid launching bgworker on a simple method I can do same.
Any suggestion on this will be very helpful.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


pg_auto_prewarm_03.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-01-31 Thread Mithun Cy
On Tue, Jan 31, 2017 at 9:47 PM, Robert Haas  wrote:
> Now, I assume that this patch sorts the I/O (although I haven't
> checked that) and therefore I expect that the prewarm completes really
> fast.  If that's not the case, then that's bad.  But if it is the
> case, then it's not really hurting you even if the workload changes
> completely.

-- JFYI Yes in the patch we load the sorted
.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-01-29 Thread Mithun Cy
> HashMetaPage _hash_getcachedmetap(Relation rel, Buffer *metabuf, bool
> force_refresh);
>
> If the cache is initialized and force_refresh is not true, then this
> just returns the cached data, and the metabuf argument isn't used.
> Otherwise, if *metabuf == InvalidBuffer, we set *metabuf to point to
> the metabuffer, pin and lock it, use it to set the cache, release the
> lock, and return with the pin still held.  If *metabuf !=
> InvalidBuffer, we assume it's pinned and return with it still pinned.

Thanks, Robert I have made a new patch which tries to do same. Now I
think code looks less complicated.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_hash_index_meta_page_14.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-01-26 Thread Mithun Cy
On Thu, Jan 26, 2017 at 8:45 PM, Peter Eisentraut
 wrote:
> Just a thought with an additional use case:  If I want to set up a
> standby for offloading queries, could I take the dump file from the
> primary or another existing standby, copy it to the new standby, and
> have it be warmed up to the state of the other instance from that?
>
> In my experience, that kind of use is just as interesting as preserving
> the buffers across a restart.

Initially, I did not think about this thanks for asking. For now, we
dump the buffer pool info in the format
; If BlockNum in
new standby correspond to the same set of rows as it was with the
server where the dump was produced, I think we can directly use the
dump file in new standby. All we need to do is just drop the ".save"
file in data-directory and preload the library. Buffer pool will be
warmed with blocks mentioned in ".save".

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-01-26 Thread Mithun Cy
On Tue, Jan 24, 2017 at 3:10 PM, Amit Kapila  wrote:
> 1.
> @@ -505,26 +505,22 @@ hashbulkdelete(IndexVacuumInfo *info,

> In the above flow, do we really need an updated metapage, can't we use
> the cached one?  We are already taking care of bucket split down in
> that function.

Yes, we can use the old cached metap entry, the only reason I decided
to use the latest metapage content is because the old code used to do
that. And, cached metap is used to avoid ad-hoc local saving of same
and hence unify the cached metap API. I did not intend to save the
metapage read here which I thought will not be much useful if new
buckets are added anyway we need to read the metapage at the end. I
have taken you comments now I only read metap cache which is already
set.

> 2.
> The above two chunks of code look worse as compare to your previous
> patch.  I think what we can do is keep the patch ready with both the
> versions of _hash_getcachedmetap API (as you have in _v11 and _v12)
> and let committer take the final call.

_v11 API's was self-sustained one but it does not hold pins on the
metapage buffer. Whereas in _v12 we hold the pin for two consecutive
reads of metapage. I have taken your advice and producing 2 different
patches.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_hash_index_meta_page_13_donotholdpin.patch
Description: Binary data


cache_hash_index_meta_page_13_holdpin.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2017-01-24 Thread Mithun Cy
On Thu, Jan 19, 2017 at 5:08 PM, Ashutosh Sharma  wrote:

Thanks, Ashutosh and Jesper. I have tested the patch I do not have any
more comments so making it ready for committer.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-01-24 Thread Mithun Cy
On Fri, Oct 28, 2016 at 6:36 AM, Tsunakawa, Takayuki
 wrote:
> I welcome this feature!  I remember pg_hibernate did this.   I wonder what 
> happened to pg_hibernate.  Did you check it?
Thanks, when I checked with pg_hibernate there were two things people
complained about it. Buffer loading will start after the recovery is
finished and the database has reached the consistent state. Two It can
replace existing buffers which are loaded due to recovery and newly
connected clients. And this solution tried to solve them.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-01-24 Thread Mithun Cy
On Tue, Jan 24, 2017 at 5:07 AM, Jim Nasby  wrote:
> I took a look at this again, and it doesn't appear to be working for me. The 
> library is being loaded during startup, but I don't see any further activity 
> in the log, and I don't see an autoprewarm file in $PGDATA.

Hi Jim,
Thanks for looking into this patch, I just downloaded the patch and
applied same to the latest code, I can see file " autoprewarm.save" in
$PGDATA which is created and dumped at shutdown time and an activity
is logged as below
2017-01-24 13:22:25.012 IST [91755] LOG:  Buffer Dump: saved metadata
of 59 blocks.

In my code by default, we only dump at shutdown time. If we want to
dump at regular interval then we need to set the GUC
pg_autoprewarm.buff_dump_interval to > 0. I think I am missing
something while trying to recreate the bug reported above. Can you
please let me know what exactly you mean by the library is not
working.

> There needs to be some kind of documentation change as part of this patch.
Thanks, I will add a sgml for same.

For remaining suggestions, I will try to address in my next patch
based on its feasibility.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-01-17 Thread Mithun Cy
On Tue, Jan 17, 2017 at 10:07 AM, Amit Kapila  wrote:
> 1.
> (a) I think you can retain the previous comment or modify it slightly.
> Just removing the whole comment and replacing it with a single line
> seems like a step backward.

-- Fixed, Just modified to say it

> (b) Another somewhat bigger problem is that with this new change it
> won't retain the pin on meta page till the end which means we might
> need to perform an I/O again during operation to fetch the meta page.
> AFAICS, you have just changed it so that you can call new API
> _hash_getcachedmetap, if that's true, then I think you have to find
> some other way of doing it.  BTW, why can't you design your new API
> such that it always take pinned metapage?  You can always release the
> pin in the caller if required.  I understand that you don't always
> need a metapage in that API, but I think the current usage of that API
> is also not that good.


-- Yes what you say is right. I wanted to make _hash_getcachedmetap
API self-sufficient. But always 2 possible consecutive reads for
metapage are connected as we pin the page to buffer to avoid I/O. Now
redesigned the API such a way that caller pass pinned meta page if we
want to set the metapage cache; So this gives us the flexibility to
use the cached meta page data in different places.


> 2.
> + if (bucket_opaque->hasho_prevblkno != InvalidBlockNumber ||
> + bucket_opaque->hasho_prevblkno > cachedmetap->hashm_maxbucket)
> + cachedmetap = _hash_getcachedmetap(rel, true);
>
> I don't understand the meaning of above if check.  It seems like you
> will update the metapage when previous block number is not a valid
> block number which will be true at the first split.  How will you
> ensure that there is a re-split and cached metapage is not relevant.
> I think if there is && in the above condition, then we can ensure it.
>

-- Oops that was a mistake corrected as you stated.

> 3.
> + Given a hashkey get the target bucket page with read lock, using cached
> + metapage. The getbucketbuf_from_hashkey method below explains the same.
> +
>
> All the sentences in algorithm start with small letters, then why do
> you need an exception for this sentence.  I think you don't need to
> add an empty line. Also, I think the usage of
> getbucketbuf_from_hashkey seems out of place.  How about writing it
> as:
>
> The usage of cached metapage is explained later.
>

-- Fixed as like you have asked.

>
> 4.
> + If target bucket is split before metapage data was cached then we are
> + done.
> + Else first release the bucket page and then update the metapage cache
> + with latest metapage data.
>
> I think it is better to retain original text of readme and add about
> meta page update.
>

-- Fixed. Now where ever it is meaning full I have kept original
wordings. But, the way we get to right target buffer after the latest
split is slightly different than what the original code did. So there
is a slight modification to show we use metapage cache.

> 5.
> + Loop:
> ..
> ..
> + Loop again to reach the new target bucket.
>
> No need to write "Loop again ..", that is implicit.
>

-- Fixed as liked you have asked.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_hash_index_meta_page_12.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2017-01-17 Thread Mithun Cy
On Tue, Jan 17, 2017 at 3:06 PM, Amit Kapila  wrote:
>
> I think your calculation is not right.  66 indicates LH_BUCKET_PAGE |
> LH_BUCKET_NEEDS_SPLIT_CLEANUP which is a valid state after the split.
> This flag will be cleared either during next split or when vacuum
> operates on that index page.

Oops, I errored in decimal to binary conversion. Sorry, I was wrong.
What you said is correct. No need to change .sgml file.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2017-01-16 Thread Mithun Cy
>
> 7. I think it is not your bug, but probably a bug in Hash index
> itself; page flag is set to 66 (for below test); So the page is both
> LH_BUCKET_PAGE and LH_BITMAP_PAGE. Is not this a bug in hash index?
>
> I have inserted 3000 records. Hash index is on integer column.
> select hasho_flag FROM hash_page_stats(get_raw_page('i1', 1));
>  hasho_flag
> 
>  66
>

Here is the test for same. After insertion of 3000 records, I think at
first split we can see bucket page flag is set with LH_BITMAP_PAGE.

create table t1( ti int);
create index i1 on t1 using hash(ti);
postgres=# select hasho_flag from hash_page_stats(get_raw_page('i1', 1));
 hasho_flag

  2
postgres=# insert into t1 select generate_series(1, 1000);
INSERT 0 1000
postgres=# select hasho_flag from hash_page_stats(get_raw_page('i1', 1));
 hasho_flag

  2
(1 row)

postgres=# insert into t1 select generate_series(1, 1000);
INSERT 0 1000
postgres=# select hasho_flag from hash_page_stats(get_raw_page('i1', 1));
 hasho_flag

  2
(1 row)

postgres=# insert into t1 select generate_series(1, 1000);
INSERT 0 1000
postgres=# select hasho_flag from hash_page_stats(get_raw_page('i1', 1));
 hasho_flag

 66
(1 row)

I think If this is a bug then example given in pageinspect.sgml should
be changed in your patch after the bug fix.
+hasho_prevblkno | -1
+hasho_nextblkno | 8474
+hasho_bucket| 0
+hasho_flag  | 66
+hasho_page_id   | 65408



-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Tuple sort is broken. It crashes on simple test.

2017-01-16 Thread Mithun Cy
Test:
--
create table seq_tab(t int);
insert into seq_tab select generate_series(1, 1000);
select count(distinct t) from seq_tab;

#0  0x0094a9ad in pfree (pointer=0x0) at mcxt.c:1007
#1  0x00953be3 in mergeonerun (state=0x1611450) at tuplesort.c:2803
#2  0x00953824 in mergeruns (state=0x1611450) at tuplesort.c:2721
#3  0x009521bc in tuplesort_performsort (state=0x1611450) at
tuplesort.c:1813
#4  0x00662b85 in process_ordered_aggregate_single
(aggstate=0x160b540, pertrans=0x160d440, pergroupstate=0x160d3f0) at
nodeAgg.c:1178
#5  0x006636e0 in finalize_aggregates (aggstate=0x160b540,
peraggs=0x160c5e0, pergroup=0x160d3f0, currentSet=0) at nodeAgg.c:1600
#6  0x00664509 in agg_retrieve_direct (aggstate=0x160b540) at
nodeAgg.c:2266
#7  0x00663f66 in ExecAgg (node=0x160b540) at nodeAgg.c:1946
#8  0x00650ad2 in ExecProcNode (node=0x160b540) at execProcnode.c:503

Git bisect shows me the following commit has caused it.

commit Id e94568ecc10f2638e542ae34f2990b821bbf90ac

Author: Heikki Linnakangas   2016-10-03 16:07:49
Committer: Heikki Linnakangas   2016-10-03 16:07:49

Change the way pre-reading in external sort's merge phase works.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pageinspect: Hash index support

2017-01-14 Thread Mithun Cy
On Fri, Jan 13, 2017 at 12:49 AM, Jesper Pedersen
 wrote:
> Rebased, and removed the compile warn in hashfuncs.c

I did some testing and review for the patch. I did not see any major
issue, but there are few minor cases for which I have few suggestions.

1. Source File :  /doc/src/sgml/pageinspect.sgml
example given.
SELECT * FROM hash_page_type(get_raw_page('con_hash_index', 0));
can be changed to
SELECT hash_page_type(get_raw_page('con_hash_index', 0));

2. @verify_hash_page
I can easily produce this error right after the split, so there will
be pages which are not inited before it is used. So an error saying it
is unexpected might be slightly misleading. I think error message
should be improved.
postgres=# SELECT hash_page_type(get_raw_page('i1', 12));
ERROR:  index table contains unexpected zero page

3. @verify_hash_page

(errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("page is not a HASH page"),
+ errdetail("Expected %08x, got %08x.",

In error message "HASH" can downcase to "hash". That makes error
messages consistent with other error messages like "page is not a hash
page of expected type"


4. The condition is raw_page_size != BLCKSZ; But error msg "input page
too small"; I think error message should be changed to "invalid page
size".

if (raw_page_size != BLCKSZ)
+ ereport(ERROR,
 i+ (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
+ errmsg("input page too small"),
+ errdetail("Expected size %d, got %d",
+   BLCKSZ, raw_page_size)));


5. @hash_bitmap_info
Metabuf can be released once after bitmapblkno is set it is off no use.
_hash_relbuf(indexRel, metabuf);

is Write lock required here for bitmap page?
mapbuf = _hash_getbuf(indexRel, bitmapblkno, HASH_WRITE, LH_BITMAP_PAGE);


6. You have renamed convert_ovflblkno_to_bitno to
_hash_ovflblkno_to_bitno. But unfortunately, you also moved the
function down. The diff appears as if you have completely replaced it.
I think we should put it back to at old place.

7. I think it is not your bug, but probably a bug in Hash index
itself; page flag is set to 66 (for below test); So the page is both
LH_BUCKET_PAGE and LH_BITMAP_PAGE. Is not this a bug in hash index?

I have inserted 3000 records. Hash index is on integer column.
select hasho_flag FROM hash_page_stats(get_raw_page('i1', 1));
 hasho_flag

 66

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Typo in hashsearch.c

2017-01-13 Thread Mithun Cy
There is a typo in comments of function _hash_first(); Adding a fix for same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


hashsearch_typo01.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-01-12 Thread Mithun Cy
On Wed, Jan 11, 2017 at 12:46 AM, Robert Haas  wrote:
> Can we adapt the ad-hoc caching logic in hashbulkdelete() to work with
> this new logic?  Or at least update the comments?
I have introduced a new function _hash_getcachedmetap in patch 11 [1] with
this hashbulkdelete() can use metapage cache instead of saving it locally.

[1] cache_hash_index_meta_page_11.patch

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Re: [HACKERS] Cache Hash Index meta page.

2017-01-12 Thread Mithun Cy
On Fri, Jan 6, 2017 at 11:43 AM, Amit Kapila  wrote:
> Few more comments:
> 1.
> no need to two extra lines, one is sufficient and matches the nearby
> coding pattern.

-- Fixed.

> 2.
> Do you see anywhere else in the code the pattern of using @ symbol in
> comments before function name?

-- Fixed.

> 3.
>
> after this change, i think you can directly use bucket in
> _hash_finish_split instead of pageopaque->hasho_bucket

-- Fixed.

> 4.
>
> Won't the check similar to the existing check (if (*bufp ==
> so->hashso_bucket_buf || *bufp == so->hashso_split_bucket_buf)) just
> above this code will suffice the need?  If so, then you can check it
> once and use it in both places.
>

-- Fixed.

> 5. The reader and insertion algorithm needs to be updated in README.

-- Added info in README.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_hash_index_meta_page_11.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rewrite HeapSatisfiesHOTAndKey

2017-01-06 Thread Mithun Cy
On Sat, Jan 7, 2017 at 11:27 AM, Mithun Cy <mithun...@enterprisedb.com> wrote:
Sorry Auto plain text setting has disturbed the table indentation.
Attaching the spreadsheet for same.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


HeapSatisfiesHOTAndKey_perf_results.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] rewrite HeapSatisfiesHOTAndKey

2017-01-06 Thread Mithun Cy
On Thu, Jan 5, 2017 at 6:15 PM, Amit Kapila  wrote:
>
> Your test and results look good, what kind of m/c you have used to
> test this.  Let me see if I or one of my colleague can do this and
> similar test on some high-end m/c.

As discussed with Amit, I have tried to run the same tests with below
modification,
1. Increased the total rows to 10milion.
2. Set fsync off;
3. Changed tests as below. Updated all rows at a time.

VACUUM FULL;
BEGIN;
UPDATE testtab SET col2 = md5(random()::text);
ROLLBACK;

I have run these tests on IBM power2 which have sufficient ram. I have
set shared_buffer=32GB.

My results show after this patch there is a slight increase in
response time (psql \timing was used) for the above update statement.
Which is around 5 to 10% delay.


Runs Response time in ms for update base. Response
Time in ms for update new patch.  %INC

1 158863.501
  167443.767
  5.4010304104

2 151061.793
  168908.536
 11.8142004312

3 153750.577
  164071.994
 6.7130915548

4 153639.165
  168364.225
 9.5841838245

5 149607.139
  166498.44
  11.2904378179


Under the same condition running original tests, that is, updating
rows which satisfy a condition col1 = :value1. I did not see any
regression.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-01-04 Thread Mithun Cy
On Wed, Jan 4, 2017 at 5:21 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
I have re-based the patch to fix one compilation warning
@_hash_doinsert where variable bucket was only used for Asserting but
was not declared about its purpose.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_hash_index_meta_page_10.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-01-04 Thread Mithun Cy
On Wed, Jan 4, 2017 at 4:19 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
> As part performance/space analysis of hash index on varchar data typevarchar 
> data type
> with this patch, I have run some tests for same with modified pgbench.with 
> this patch, I have run some tests for same with modified pgbench.
I forgot to mention all these tests were run on power2 machine which
has 192  2 hyperthreads.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-01-04 Thread Mithun Cy
On Tue, Jan 3, 2017 at 12:05 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:

As part performance/space analysis of hash index on varchar data type
with this patch, I have run some tests for same with modified pgbench.
Modification includes:
1. Changed aid column of pg_accounts table from int to varchar(x)
2.  To generate unique values as before inserted stringified integer
repeatedly to fill x.

I have mainly tested for varchar(30) and varchar(90),
Below document has the detailed report on our captured data. New hash
indexes have some ~25% improved performance over btree. And, as
expected very space efficient.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


hash_index_sizes_experiment_01.ods
Description: application/vnd.oasis.opendocument.spreadsheet

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2017-01-02 Thread Mithun Cy
Thanks Amit for detailed review, and pointing out various issues in
the patch. I have tried to fix all of your comments as below.

On Mon, Jan 2, 2017 at 11:29 AM, Amit Kapila  wrote:

> 1.
> usage "into to .." in above comment seems to be wrong.usage "into to .." in 
> above comment seems to be wrong.usage "into to .." in above comment seems to 
> be wrong.usage "into to .." in above comment seems to be wrong.

-- Fixed.

> 2.
> In the above comment, a reference to HashMetaCache is confusing, are
> your referring to any structure here?  If you change this, consider toyour 
> referring to any structure here?  If you change this, consider toyour 
> referring to any structure here?  If you change this, consider toyour 
> referring to any structure here?  If you change this, consider to
> change the similar usage at other places in the patch as well.change the 
> similar usage at other places in the patch as well.change the similar usage 
> at other places in the patch as well.change the similar usage at other places 
> in the patch as well.

-- Fixed. Removed HashMetaCache everywhere in the code. Where ever
needed added HashMetaPageData.

> Also, it is not clear to what do you mean by ".. to indicate bucketto 
> indicate bucket
> has been initialized .."?  assigning maxbucket as a special value tohas been 
> initialized .."?  assigning maxbucket as a special value to
> prevblkno is not related to initializing a bucket page.prevblkno is not 
> related to initializing a bucket page.

-- To be consistent with our definition of prevblkno, I am setting
prevblkno with current hashm_maxbucket when we initialize the bucket
page. I have tried to correct the comments accordingly.

> 3.
> In above comment, where you are saying ".. caching the some of the
> meta page data .." is slightly confusing, because it appears to me
> that you are caching whole of the metapage not some part of it.

-- Fixed. Changed to caching the HashMetaPageData.

> 4.
> +_hash_getbucketbuf_from_hashkey(uint32 hashkey, Relation rel,
>
> Generally, for _hash_* API's, we use rel as the first parameter, so I
> think it is better to maintain the same with this API as well.

-- Fixed.

> 5.
>   _hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
> -   maxbucket, highmask, lowmask);
> +   usedmetap->hashm_maxbucket,
> +   usedmetap->hashm_highmask,
> +   usedmetap->hashm_lowmask);

> I think we should add an Assert for the validity of usedmetap before using it.

-- Fixed. Added Assert(usedmetap != NULL);

> 6. Getting few compilation errors in assert-enabled build.
>

-- Fixed. Sorry, I missed handling bucket number which is needed at
below codes. I have fixed same now.

> 7.
> I can understand what you want to say in above comment, but I think
> you can write it in somewhat shorter form.  There is no need to
> explain the exact check.

-- Fixed. I have tried to compress it into a few lines.

> 8.
> @@ -152,6 +152,11 @@ _hash_readprev(IndexScanDesc scan,
>   _hash_relbuf(rel, *bufp);
>
>   *bufp = InvalidBuffer;
> +
> + /* If it is a bucket page there will not be a prevblkno. */
> + if ((*opaquep)->hasho_flag & LH_BUCKET_PAGE)
> + return;
> +
>
> I don't think above check is right.  Even if it is a bucket page, we
> might need to scan bucket being populated, refer check else if
> (so->hashso_buc_populated && so->hashso_buc_split).  Apart from that,
> you can't access bucket page after releasing the lock on it.  Why have
> you added such a check?
>

-- Fixed. That was a mistake, now I have fixed it. Actually, if bucket
page is passed to _hash_readprev then there will not be a prevblkno.
But from this patch onwards prevblkno of bucket page will store
hashm_maxbucket. So a check BlockNumberIsValid (blkno) will not be
valid anymore. I have fixed by adding as below.
+ /* If it is a bucket page there will not be a prevblkno. */
+ if (!((*opaquep)->hasho_flag & LH_BUCKET_PAGE))
  {
+ Assert(BlockNumberIsValid(blkno));

There are 2 other places which does same @_hash_freeovflpage,
@_hash_squeezebucket.
But that will only be called for overflow pages. So I did not make
changes. But I think we should also change there to make it
consistent.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


cache_hash_index_meta_page_09.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2016-12-27 Thread Mithun Cy
On Tue, Dec 27, 2016 at 1:36 PM, Mithun Cy <mithun...@enterprisedb.com> wrote:
Oops, patch number should be 08 re-attaching same after renaming.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 46df589..c45b3f0 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -32,19 +32,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
-	BlockNumber blkno;
-	BlockNumber oldblkno;
-	bool		retry;
+	HashMetaPage usedmetap;
 	Page		metapage;
 	Page		page;
 	HashPageOpaque pageopaque;
 	Size		itemsz;
 	bool		do_expand;
 	uint32		hashkey;
-	Bucket		bucket;
-	uint32		maxbucket;
-	uint32		highmask;
-	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -57,10 +51,15 @@ _hash_doinsert(Relation rel, IndexTuple itup)
  * need to be consistent */
 
 restart_insert:
-	/* Read the metapage */
-	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+
+	/*
+	 * Load the metapage. No need to lock as of now because we only access
+	 * page header element pd_pagesize_version in HashMaxItemSize(), this
+	 * element is constant and will not move while accessing. But we hold the
+	 * pin so we can use the metabuf while writing into to it below.
+	 */
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_NOLOCK, LH_META_PAGE);
 	metapage = BufferGetPage(metabuf);
-	metap = HashPageGetMeta(metapage);
 
 	/*
 	 * Check whether the item can fit on a hash page at all. (Eventually, we
@@ -76,59 +75,7 @@ restart_insert:
 		itemsz, HashMaxItemSize(metapage)),
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
-	oldblkno = InvalidBlockNumber;
-	retry = false;
-
-	/*
-	 * Loop until we get a lock on the correct target bucket.
-	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-	  metap->hashm_maxbucket,
-	  metap->hashm_highmask,
-	  metap->hashm_lowmask);
-
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
-
-		/*
-		 * Copy bucket mapping info now; refer the comment in
-		 * _hash_expandtable where we copy this information before calling
-		 * _hash_splitbucket to see why this is okay.
-		 */
-		maxbucket = metap->hashm_maxbucket;
-		highmask = metap->hashm_highmask;
-		lowmask = metap->hashm_lowmask;
-
-		/* Release metapage lock, but keep pin. */
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-		/*
-		 * If the previous iteration of this loop locked the primary page of
-		 * what is still the correct target bucket, we are done.  Otherwise,
-		 * drop any old lock before acquiring the new one.
-		 */
-		if (retry)
-		{
-			if (oldblkno == blkno)
-break;
-			_hash_relbuf(rel, buf);
-		}
-
-		/* Fetch and lock the primary bucket page for the target bucket */
-		buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
-
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		LockBuffer(metabuf, BUFFER_LOCK_SHARE);
-		oldblkno = blkno;
-		retry = true;
-	}
+	buf = _hash_getbucketbuf_from_hashkey(hashkey, rel, HASH_WRITE, );
 
 	/* remember the primary bucket buffer to release the pin on it at end. */
 	bucket_buf = buf;
@@ -152,7 +99,9 @@ restart_insert:
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		_hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
-		   maxbucket, highmask, lowmask);
+		   usedmetap->hashm_maxbucket,
+		   usedmetap->hashm_highmask,
+		   usedmetap->hashm_lowmask);
 
 		/* release the pin on old and meta buffer.  retry for insert. */
 		_hash_dropbuf(rel, buf);
@@ -225,6 +174,7 @@ restart_insert:
 	 */
 	LockBuffer(metabuf, BUFFER_LOCK_EXCLUSIVE);
 
+	metap = HashPageGetMeta(metapage);
 	metap->hashm_ntuples += 1;
 
 	/* Make sure this stays in sync with _hash_expandtable() */
diff --git a/src/backend/access/hash/hashpage.c b/src/backend/access/hash/hashpage.c
index 45e184c..c726909 100644
--- a/src/backend/access/hash/hashpage.c
+++ b/src/backend/access/hash/hashpage.c
@@ -434,7 +434,13 @@ _hash_metapinit(Relation rel, double num_tuples, ForkNumber forkNum)
 		buf = _hash_getnewbuf(rel, BUCKET_TO_BLKNO(metap, i), forkNum);
 		pg = BufferGetPage(buf);
 		pageopaque = (HashPageOpaque) PageGetSpecialPointer(pg);
-		pageopaque->hasho_prevblkno = InvalidBlockNumber;
+
+		/*
+		 * Setting hasho_prevblkno of bucket page with latest maxbucket number
+		 * to indicate bucket has been initialized and need to reconstruct
+		 * HashMetaCache if it is older.
+		 */
+		pageopaque->hasho_prevblkno = metap->hashm_maxbucket;
 		pageopaque->hasho_nextblkno = InvalidBlockNumber;
 		pageopaque->hasho_bucket = i;
 		pageopaque->hasho_flag = 

Re: [HACKERS] Cache Hash Index meta page.

2016-12-27 Thread Mithun Cy
On Thu, Dec 22, 2016 at 12:17 PM, Amit Kapila <amit.kapil...@gmail.com> wrote:
> On Wed, Dec 21, 2016 at 9:26 PM, Robert Haas <robertmh...@gmail.com> wrote:
>> On Tue, Dec 20, 2016 at 2:25 PM, Mithun Cy <mithun...@enterprisedb.com> 
>> wrote:
>>> -- I think if it is okay, I can document same for each member of 
>>> HashMetaPageData whether to read from cached from meta page or directly 
>>> from current meta page. Below briefly I have commented for each member. If 
>>> you suggest I can go with that approach, I will produce a neat patch for 
>>> same.
>>
>> Plain text emails are preferred on this list.

Sorry, I have set the mail to plain text mode now.

> I think this will make metpage cache access somewhat
> similar to what we have in btree where we use cache to access
> rootpage.  Will something like that address your concern?

Thanks, just like _bt_getroot I am introducing a new function
_hash_getbucketbuf_from_hashkey() which will give the target bucket
buffer for the given hashkey. This will use the cached metapage
contents instead of reading meta page buffer if cached data is valid.
There are 2 places which can use this service 1. _hash_first and 2.
_hash_doinsert.

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com
diff --git a/src/backend/access/hash/hashinsert.c b/src/backend/access/hash/hashinsert.c
index 46df589..c45b3f0 100644
--- a/src/backend/access/hash/hashinsert.c
+++ b/src/backend/access/hash/hashinsert.c
@@ -32,19 +32,13 @@ _hash_doinsert(Relation rel, IndexTuple itup)
 	Buffer		bucket_buf;
 	Buffer		metabuf;
 	HashMetaPage metap;
-	BlockNumber blkno;
-	BlockNumber oldblkno;
-	bool		retry;
+	HashMetaPage usedmetap;
 	Page		metapage;
 	Page		page;
 	HashPageOpaque pageopaque;
 	Size		itemsz;
 	bool		do_expand;
 	uint32		hashkey;
-	Bucket		bucket;
-	uint32		maxbucket;
-	uint32		highmask;
-	uint32		lowmask;
 
 	/*
 	 * Get the hash key for the item (it's stored in the index tuple itself).
@@ -57,10 +51,15 @@ _hash_doinsert(Relation rel, IndexTuple itup)
  * need to be consistent */
 
 restart_insert:
-	/* Read the metapage */
-	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_READ, LH_META_PAGE);
+
+	/*
+	 * Load the metapage. No need to lock as of now because we only access
+	 * page header element pd_pagesize_version in HashMaxItemSize(), this
+	 * element is constant and will not move while accessing. But we hold the
+	 * pin so we can use the metabuf while writing into to it below.
+	 */
+	metabuf = _hash_getbuf(rel, HASH_METAPAGE, HASH_NOLOCK, LH_META_PAGE);
 	metapage = BufferGetPage(metabuf);
-	metap = HashPageGetMeta(metapage);
 
 	/*
 	 * Check whether the item can fit on a hash page at all. (Eventually, we
@@ -76,59 +75,7 @@ restart_insert:
 		itemsz, HashMaxItemSize(metapage)),
 			errhint("Values larger than a buffer page cannot be indexed.")));
 
-	oldblkno = InvalidBlockNumber;
-	retry = false;
-
-	/*
-	 * Loop until we get a lock on the correct target bucket.
-	 */
-	for (;;)
-	{
-		/*
-		 * Compute the target bucket number, and convert to block number.
-		 */
-		bucket = _hash_hashkey2bucket(hashkey,
-	  metap->hashm_maxbucket,
-	  metap->hashm_highmask,
-	  metap->hashm_lowmask);
-
-		blkno = BUCKET_TO_BLKNO(metap, bucket);
-
-		/*
-		 * Copy bucket mapping info now; refer the comment in
-		 * _hash_expandtable where we copy this information before calling
-		 * _hash_splitbucket to see why this is okay.
-		 */
-		maxbucket = metap->hashm_maxbucket;
-		highmask = metap->hashm_highmask;
-		lowmask = metap->hashm_lowmask;
-
-		/* Release metapage lock, but keep pin. */
-		LockBuffer(metabuf, BUFFER_LOCK_UNLOCK);
-
-		/*
-		 * If the previous iteration of this loop locked the primary page of
-		 * what is still the correct target bucket, we are done.  Otherwise,
-		 * drop any old lock before acquiring the new one.
-		 */
-		if (retry)
-		{
-			if (oldblkno == blkno)
-break;
-			_hash_relbuf(rel, buf);
-		}
-
-		/* Fetch and lock the primary bucket page for the target bucket */
-		buf = _hash_getbuf(rel, blkno, HASH_WRITE, LH_BUCKET_PAGE);
-
-		/*
-		 * Reacquire metapage lock and check that no bucket split has taken
-		 * place while we were awaiting the bucket lock.
-		 */
-		LockBuffer(metabuf, BUFFER_LOCK_SHARE);
-		oldblkno = blkno;
-		retry = true;
-	}
+	buf = _hash_getbucketbuf_from_hashkey(hashkey, rel, HASH_WRITE, );
 
 	/* remember the primary bucket buffer to release the pin on it at end. */
 	bucket_buf = buf;
@@ -152,7 +99,9 @@ restart_insert:
 		LockBuffer(buf, BUFFER_LOCK_UNLOCK);
 
 		_hash_finish_split(rel, metabuf, buf, pageopaque->hasho_bucket,
-		   maxbucket, highmask, lowmask);
+		   usedmetap->hashm_maxbucket,
+		   usedmetap->hashm_highmask,
+		   usedmetap->hashm_lowmask);
 
 		/* release the pin on old and meta buffer.  retry f

[HACKERS] Ilegal type cast in _hash_doinsert()

2016-12-21 Thread Mithun Cy
There seems to be some base bug in _hash_doinsert().
+* XXX this is useless code if we are only storing hash keys.

+   */

+if (itemsz > HashMaxItemSize((Page) metap))

+ereport(ERROR,

+(errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),

+ errmsg("index row size %zu exceeds hash maximum %zu",

+itemsz, HashMaxItemSize((Page) metap)),

+   errhint("Values larger than a buffer page cannot be
indexed.")));

 "metap" (HashMetaPage) is not a Page (they the entirely different
structure), so above casting is not legal. Added a patch to correct same by
passing actual Page which stores "HashMetaPage".

-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


hash_doinsert_illegal_cast_01.patch
Description: Binary data

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cache Hash Index meta page.

2016-12-20 Thread Mithun Cy
On Tue, Dec 20, 2016 at 3:51 AM, Robert Haas <robertmh...@gmail.com> wrote:

> On Fri, Dec 16, 2016 at 5:16 AM, Mithun Cy <mithun...@enterprisedb.com>
> wrote:
>
>> Shouldn't _hash_doinsert() be using the cache, too
>>>
>>
>> Yes, we have an opportunity there, added same in code. But one difference
>> is at the end at-least once we need to read the meta page to split and/or
>> write. Performance improvement might not be as much as read-only.
>>
>
> Why would you need to read it at least once in one case but not the other?
>

 For read-only: in _hash_first if target bucket is not plit after caching
the meta page contents. we never need to read the metapage. But for insert:
in _hash_doinsert at the end we modify the meta page to store the number of
tuples.

+  * Write-lock the metapage so we can increment the tuple count. After
+  * incrementing it, check to see if it's time for a split.
+ */
+ _hash_chgbufaccess(rel, metabuf, HASH_NOLOCK, HASH_WRITE);
+
+ metap->hashm_ntuples += 1;


Well, I guess that makes it more appealing to cache the whole page at least
> in terms of raw number of bytes, but I suppose my real complaint here is
> that there don't seem to be any clear rules for where, whether, and to what
> extent we can rely on the cache to be valid.
>

--- Okay will revert back to cache the entire meta page.


> Without that, I think this patch is creating an extremely serious
> maintenance hazard for anyone who wants to try to modify this code in the
> future.  A future patch author needs to be able to tell what data they can
> use and what data they can't use, and why.
>


-- I think if it is okay, I can document same for each member
of HashMetaPageData whether to read from cached from meta page or directly
from current meta page. Below briefly I have commented for each member. If
you suggest I can go with that approach, I will produce a neat patch for
same.

*typedef struct HashMetaPageData*

*{*


*1. uint32 hashm_magic; /* magic no. for hash tables */*

-- I think this should remain same. So can be read from catched meta page.

*2. uint32 hashm_version; /* version ID */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*3. double hashm_ntuples; /* number of tuples stored in the table */*

*-*- This changes on every insert. So should not be read from chached data.

*4. uint16 hashm_ffactor; /* target fill factor (tuples/bucket) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*5. uint16 hashm_bsize; /* index page size (bytes) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage.

*6. uint16 hashm_bmsize; /* bitmap array size (bytes) - must be a power of
2 */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage

*7. uint16 hashm_bmshift; /* log2(bitmap array size in BITS) */*

-- This is one time initied, never changed afterwards. So can be read from
catched metapage

*8. *If hashm_maxbucket, hashm_highmask and hashm_lowmask are all read and
cached at same time when metapage was locked, then key to bucket number map
function _hash_hashkey2bucket() should always produce same output.  If
bucket is split after caching above elements (which can be known because
old bucket pages will never move once allocated and we mark bucket
pages hasho_prevblkno with incremented hashm_highmask), we can invalidate
them and re-read same from meta page. If your intention is not to save a
metapage read while trying to map the key to buket page, then do not read
them from cached meta page.

* uint32 hashm_maxbucket; /* ID of maximum bucket in use */*

* uint32 hashm_highmask; /* mask to modulo into entire table */*

* uint32 hashm_lowmask; /* mask to modulo into lower half of table */*

*9.*

* uint32 hashm_ovflpoint;/* splitpoint from which ovflpgs being*

* * allocated */*

-- Since used for allocation of overflow pages, should get latest value
directly from meta page.

*10.*

* uint32 hashm_firstfree; /* lowest-number free ovflpage (bit#) */*

-- Should always be read from metapage directly.

*11. *

* uint32 hashm_nmaps; /* number of bitmap pages */*

-- Should always be read from metapage directly.

*12.*

*RegProcedure hashm_procid; /* hash procedure id from pg_proc */*

-- Never used till now, When we use, need to look at it about its policy.

*13*

*uint32 hashm_spares[HASH_MAX_SPLITPOINTS]; /* spare pages before*

* * each splitpoint */*

Spares before given split_point never change and bucket pages never move.
So when combined with cached hashm_maxbucket, hashm_highmask
and hashm_lowmask, all read at same time under lock, BUCKET_TO_BLKNO should
always produce same output, pointing to right physical block. Should only
be used to save a meta page read when we want to map key to bucket block as
said above.

*14.*

* Bl

  1   2   >