Re: Speeding up GIST index creation for tsvectors

2021-08-01 Thread Amit Khandekar
On Sat, 20 Mar 2021 at 02:19, John Naylor  wrote:
> On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar  wrote:
> > Regarding the alignment changes... I have removed the code that
> > handled the leading identically unaligned bytes, for lack of evidence
> > that percentage of such cases is significant. Like I noted earlier,
> > for the tsearch data I used, identically unaligned cases were only 6%.
> > If I find scenarios where these cases can be significant after all and
> > if we cannot do anything in the gist index code, then we might have to
> > bring back the unaligned byte handling. I didn't get a chance to dig
> > deeper into the gist index implementation to see why they are not
> > always 8-byte aligned.
>
> I find it stranger that something equivalent to char* is not randomly 
> misaligned, but rather only seems to land on 4-byte boundaries.
>
> [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
>
> FWIW, I anticipate some push back from the community because of the fact that 
> the optimization relies on statistical phenomena.

I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the pointers to
be misaligned:

Datum
gtsvector_picksplit(PG_FUNCTION_ARGS)
{
..
cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2));
cache_sign = palloc(siglen * (maxoff + 2));

for (j = 0; j < maxoff + 2; j++)
cache[j].sign = _sign[siglen * j];

}

If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.

Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.

Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.




Re: speed up verifying UTF-8

2021-07-18 Thread Amit Khandekar
On Sat, 17 Jul 2021 at 04:48, John Naylor  wrote:
> v17-0001 is the same as v14. 0002 is a stripped-down implementation of Amit's
> chunk idea for multibyte, and it's pretty good on x86. On Power8, not so
> much. 0003 and 0004 are shot-in-the-dark guesses to improve it on Power8,
> with some success, but end up making x86 weirdly slow, so I'm afraid that
> could happen on other platforms as well.

Thanks for trying the chunk approach. I tested your v17 versions on
Arm64. For the chinese characters, v17-0002 gave some improvement over
v14. But for all the other character sets, there was around 10%
degradation w.r.t. v14. I thought maybe the hhton64 call and memcpy()
for each mb character might be the culprit, so I tried iterating over
all the characters in the chunk within the same pg_utf8_verify_one()
function by left-shifting the bits. But that worsened the figures. So
I gave up that idea.

Here are the numbers on Arm64 :

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1781 |  1095 |   628 | 944 |   1151

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 852 |   484 |   144 | 584 |971


v17-0001+2:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 731 |   520 |   152 | 645 |   1118


Haven't looked at your v18 patch set yet.




Re: speed up verifying UTF-8

2021-07-14 Thread Amit Khandekar
On Tue, 13 Jul 2021 at 01:15, John Naylor  wrote:
> > It seems like it would be easy to have pg_utf8_verify_one in my proposed 
> > pg_utf8.h header and replace the body of pg_utf8_verifychar with it.
>
> 0001: I went ahead and tried this for v15, and also attempted some clean-up:
>
> - Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
> - Have pg_utf8_verifychar_internal return -1 for invalid input to match other 
> functions in the file. We could also do this for check_ascii, but it's not 
> quite the same thing, because the string could still have valid bytes in it, 
> just not enough to advance the pointer by the stride length.
> - Remove hard-coded numbers (not wedded to this).
>
> - Use a call to pg_utf8_verifychar in the slow path.
> - Reduce pg_utf8_verifychar to thin wrapper around 
> pg_utf8_verifychar_internal.

- check_ascii() seems to be used only for 64-bit chunks. So why not
remove the len argument and the len <= sizeof(int64) checks inside the
function. We can rename it to check_ascii64() for clarity.

- I was thinking, why not have a pg_utf8_verify64() that processes
64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
extract a 64-bit chunk from the string. We can use the same chunk to
extract the required bits from a two byte char or a 4 byte char. This
way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
etc. More importantly, we can avoid the separate continuation-char
checks for each individual byte. Additionally, we can try to simplify
the subsequent overlong or surrogate char checks. Something like this
:

int pg_utf8_verifychar_32(uint32 chunk)
{
intlen, l;

for (len = sizeof(chunk); len > 0; (len -= l), (chunk = chunk << l))
{
/* Is 2-byte lead */
if ((chunk & 0xF000) == 0xC000)
{
l = 2;
/* ...  ...  */
}
/* Is 3-byte lead */
else if ((chunk & 0xF000) == 0xE000)
{
l = 3;
if (len < l)
break;

/* b2 and b3 should be continuation bytes */
if ((chunk & 0x00C0C000) != 0x00808000)
return sizeof(chunk) - len;

switch (chunk & 0xFF20)
{
/* check 3-byte overlong: 1110. 1001. 10xx.
 * i.e. (b1 == 0xE0 && b2 < 0xA0). We already know b2
is of the form
 * 10xx since it's a continuation char. Additionally
condition b2 <=
 * 0x9F means it is of the form 100x..  i.e.
either 1000.
 * or 1001.. So just verify that it is xx0x.
 */
case 0xE000:
return sizeof(chunk) - len;

/* check surrogate: 1110.1101 101x. 10xx.
 * i.e. (b1 == 0xED && b2 > 0x9F): Here, > 0x9F means either
 * 1010., 1011., 1100., or 1110.. Last
two are not
 * possible because b2 is a continuation char. So it has to be
 * first two. So just verify that it is xx1x.
 */
case 0xED20:
return sizeof(chunk) - len;
default:
;
}

}
/* Is 4-byte lead */
else if ((chunk & 0xF000) == 0xF000)
{
/* .  */
l = 4;
}
else
return sizeof(chunk) - len;
}
return sizeof(chunk) - len;
}




Relaxing the sub-transaction restriction in parallel query

2021-06-13 Thread Amit Khandekar
Hi,

Currently we don't allow a sub-transaction to be spawned from inside a
parallel worker (and also from a leader who is in parallel mode). This
imposes a restriction that pl/pgsql functions that use an exception block
can't be marked parallel safe, even when the exception block is there only
to catch trivial errors such as divide-by-zero. I tried to look at the
implications of removing this sub-transaction restriction, and came up with
the attached WIP patch after considering the below points. I may have
missed other points, or may have assumed something wrong. So comments are
welcome.

- TBLOCK_PARALLEL_INPROGRESS

Now that there can be an in-progress sub-transaction in a parallel worker,
the sub-transaction states need to be accounted for. Rather than having new
transaction states such as TBLOCK_PARALLEL_SUBINPROGRESS, I removed the
existing TBLOCK_PARALLEL_INPROGRESS from the code. At a couple of places
is_parallel_worker is set if state is TBLOCK_PARALLEL_INPROGRESS. Instead,
for now, I have used (ParallelCurrentXids != NULL) to identify if it's a
worker in a valid transaction state. Maybe we can improve on this.
In EndTransactionBlock(), there is a fatal error thrown if it's a
parallel-worker in-progress. This seems to be a can't-have case. So I have
removed this check. Need to further think how we can retain this check.


- IsInParallelMode()

On HEAD, the parallel worker cannot have any sub-transactions, so
CurrentTransactionState always points to the TopTransactionStateData. And
when ParallelWorkerMain() calls EnterParallelMode(), from that point
onwards IsInParallelMode() always returns true. But with the patch,
CurrentTransactionState can point to some nest level down below, and in
that TransactionState, parallelModeLevel would be 0, so IsInParallelMode()
will return false in a sub-transaction, unless some other function happens
to explicitly call EnterParallelMode().

One option for making IsInParallelMode() always return true for worker is
to just check whether the worker is in a transaction (ParallelCurrentXids
!= NULL). Or else, check only the TopTransactionData->parallelModeLevel.
Still another option is for the new TransactionState to inherit
parallelModeLevel from it's parent. I chose this option. This avoids
additional conditions in IsInParallelMode() specifically for worker.

Does this inherit-parent-parallelModeLevel option affect the leader code ?
The functions calling EnterParallelMode() are : begin_parallel_vacuum,
_bt_begin_parallel, ParallelWorkerMain, CommitTransaction, ExecutePlan().
After entering Parallel mode, it does not look like a subtransaction will
be spawned at these places. If at all it does, on HEAD, the
IsInParallelMode() will return false, which does not sound right. For all
the child transactions, this function should return true. So w.r.t. this,
in fact inheriting parent's parallelModeLevel looks better.

Operations that are not allowed to run in worker would continue to be
disallowed in a worker sub-transaction as well. E.g. assigning new xid,
heap_insert, etc. These places already are using IsInParallelMode() which
takes care of guarding against such operations.

Just for archival ...
ExecutePlan() is called with use_parallel_mode=true when there was a gather
plan created, in which case it enters Parallel mode. From here, if the
backend happens to start a new subtransaction for some reason, it does
sound right for the Parallel mode to be true for this sub-transaction,
although I am not sure if there can be such a case.
In worker, as far as I understand, ExecutePlan() always gets called with
use_parallel_mode=false, because there is no gather plan in the worker. So
it does not enter Parallel mode. But because the worker is already in
parallel mode, it does not matter.


- List of ParallelContexts (pcxt_list) :
ParallelContext is created only in backends. So there are no implications
of the pcxt_list w.r.t. parallel workers spawning a subtransaction, because
pcxt_list will always be empty in workers.


- ParallelCurrentXids  :

A parallel worker always maintains a global flat sorted list of xids which
represent all the xids that are considered as current xids (i.e. the ones
that are returned by TransactionIdIsCurrentTransactionId() in a leader). So
this global list should continue to work no matter what is the
sub-transaction nest level, since there won't be new xids created in the
worker.

- Savepoints :

Haven't considered savepoints. The restriction is retained for savepoints.

Thanks
-Amit Khandekar
Huawei Technologies
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 441445927e..c46e32ba37 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -160,7 +160,6 @@ typedef enum TBlockState
 	TBLOCK_BEGIN,/* starting transaction block */
 	TBLOCK_INPROGRESS,			/* live transaction */
 	TBLOCK_IMPLICIT_INPROGRESS, /* live transaction after implicit BEGIN */
-	TBLOCK_PARALLEL_INPROGR

Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar
On Wed, 28 Apr 2021 at 13:54, David Rowley  wrote:

> So, given that I removed the parallel test in partition_prune.sql, and
> don't have any EXPLAIN ANALYZE output for parallel tests in
> resultcache.sql, it should be safe enough to put that cache_misses ==
> 0 test back into explain.c
>
> I've attached a patch to do this. The explain.c part is pretty similar
> to your patch, I just took my original code and comment.

Sounds good. And thanks for the cleanup patch, and the brief history.
Patch looks ok to me.




Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar
On Wed, 28 Apr 2021 at 16:14, David Rowley  wrote:
> However, I did add 1 test that sets work_mem down to 64kB to ensure
> the eviction code does get some exercise. You'll notice that I pass
> "true" to explain_resultcache() to hide the hits and misses there.  We
> can't test the exact number of hits/misses/evictions there, but we can
> at least tell apart the zero and non-zero by how I coded
> explain_resultcache() to replace with Zero or N depending on if the
> number was zero or above zero.

Thanks for the explanation. I did realize after replying to Bharat
upthread, that I was wrong in assuming that the cache misses and cache
hits are always stable for non-parallel scans.




Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar
On Wed, 28 Apr 2021 at 15:08, Bharath Rupireddy
 wrote:
>
> On Wed, Apr 28, 2021 at 1:54 PM David Rowley  wrote:
> > I plan to push this in the next 24 hours or so.
>
> I happen to see explain_resultcache in resultcache.sql, seems like two
> of the tests still have numbers for cache hits and misses - Hits: 980
> Misses: 20, won't these make tests unstable? Will these numbers be
> same across machines? Or is it that no buildfarm had caught these? The
> comment below says that, the hits and misses are not same across
> machines:
> -- Ensure we get some evictions.  We're unable to validate the hits and misses
> -- here as the number of entries that fit in the cache at once will vary
> -- between different machines.
>
> Should we remove the hide_hitmiss parameter in explain_resultcache and
> always print N for non-zero and Zero for 0 hits and misses? This
> clearly shows that we have 0 or non-zero hits or misses.
>
> Am I missing something?

I believe, the assumption here is that with no workers involved, it is
guaranteed to have the exact same cache misses and hits anywhere.




Result Cache node shows per-worker info even for workers not launched

2021-04-27 Thread Amit Khandekar
Hi,

If planned parallel workers do not get launched, the Result Cache plan
node shows all-0 stats for each of those workers:

tpch=# set max_parallel_workers TO 0;
SET
tpch=# explain analyze
select avg(l_discount) from orders, lineitem
where
l_orderkey = o_orderkey
and o_orderdate < date '1995-03-09'
and l_shipdate > date '1995-03-09';


QUERY PLAN
-
 Finalize Aggregate  (cost=315012.87..315012.88 rows=1 width=32)
(actual time=27533.482..27533.598 rows=1 loops=1)
   ->  Gather  (cost=315012.44..315012.85 rows=4 width=32) (actual
time=27533.471..27533.587 rows=1 loops=1)
 Workers Planned: 4
 Workers Launched: 0
 ->  Partial Aggregate  (cost=314012.44..314012.45 rows=1
width=32) (actual time=27533.177..27533.178 rows=1 loops=1)
   ->  Nested Loop  (cost=0.44..309046.68 rows=1986303
width=4) (actual time=0.400..27390.835 rows=748912 loops=1)
 ->  Parallel Seq Scan on lineitem
(cost=0.00..154513.66 rows=4120499 width=12) (actual
time=0.044..7910.399 rows=16243662 loops=1)
   Filter: (l_shipdate > '1995-03-09'::date)
   Rows Removed by Filter: 13756133
 ->  Result Cache  (cost=0.44..0.53 rows=1
width=4) (actual time=0.001..0.001 rows=0 loops=16243662)
   Cache Key: lineitem.l_orderkey
   Hits: 12085749  Misses: 4157913  Evictions:
3256424  Overflows: 0  Memory Usage: 65537kB
   Worker 0:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 1:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 2:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 3:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   ->  Index Scan using orders_pkey on orders
(cost=0.43..0.52 rows=1 width=4) (actual time=0.002..0.002 rows=0
loops=4157913)
 Index Cond: (o_orderkey = lineitem.l_orderkey)
 Filter: (o_orderdate < '1995-03-09'::date)
 Rows Removed by Filter: 1
 Planning Time: 0.211 ms
 Execution Time: 27553.477 ms
(22 rows)

By looking at the other cases like  show_sort_info() or printing
per-worker jit info, I could see that the general policy is that we
skip printing info for workers that are not launched. Attached is a
patch to do the same for Result Cache.

I was earlier thinking about using (instrument[n].nloops == 0) to
check for not-launched workers. But we are already using "if
(rcstate->stats.cache_misses == 0)" for the leader process, so for
consistency I used the same method for workers.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies


skip_notlaunched_workers.patch
Description: Binary data


Re: Speeding up GIST index creation for tsvectors

2021-03-19 Thread Amit Khandekar
On Thu, 11 Mar 2021 at 04:25, John Naylor  wrote:
> Okay, so it's hard-coded in various functions in contrib modules. In that
> case, let's just keep the existing coding for those. In fact, the comments
> that got removed by your patch specifically say: /* Using the popcount
> functions here isn't likely to win */ ...which your testing confirmed. The
> new function should be used only for Gist and possibly Intarray, whose
> default is 124. That means we can't get rid of hemdistsign(), but that's
> fine.

The comment is there for all types. Since I get the performance better
on all the types, I have kept the pg_xorcount() call for all these
contrib modules.  I understand that since for some types default
siglen is small, we won't get benefit. But I think we should call
pg_xorcount() for the benefit of non-default siglen case.

Have replaced hemdistsign() by pg_xorcount() everywhere; but with
that, the call looks a bit clumsy because of having to type-cast the
parameters to const unsigned char *.  I noticed that the cast to
"unsigned char *" is required only when we use the value in the
pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
"char *". So I changed the pg_xorcount() parameters from unsigned char
* to char *.

> I think the way to go is a simplified version of your 0001 (not 0002), with
> only a single function, for gist and intarray only, and a style that better
> matches the surrounding code. If you look at my xor functions in the attached
> text file, you'll get an idea of what it should look like. Note that it got
> the above performance without ever trying to massage the pointer alignment.
> I'm a bit uncomfortable with the fact that we can't rely on alignment, but
> maybe there's a simple fix somewhere in the gist code.

In the attached 0001-v3 patch, I have updated the code to match it
with surrounding code; specifically the usage of *a++ rather than
a[i].

Regarding the alignment changes... I have removed the code that
handled the leading identically unaligned bytes, for lack of evidence
that percentage of such cases is significant. Like I noted earlier,
for the tsearch data I used, identically unaligned cases were only 6%.
If I find scenarios where these cases can be significant after all and
if we cannot do anything in the gist index code, then we might have to
bring back the unaligned byte handling. I didn't get a chance to dig
deeper into the gist index implementation to see why they are not
always 8-byte aligned.

I have kept the 0002 patch as-is. Due to significant *additional*
speedup, over and above the 0001 improvement, I think the code
re-arrangement done is worth it for non-x86 platforms.

-Amit Khandekar
From c6ae575dd483b85cec6748c2e014d0f32565b4eb Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Fri, 19 Mar 2021 20:22:44 +0800
Subject: [PATCH 1/2] Speed up xor'ing of two gist index signatures for
 tsvectors

In hemdistsign(), rather than using xor operator on char values, use
it in 64-bit chunks. And since the chunks are 64-bit, use popcount64()
on each of the chunks. I have checked that the two bitvector pointer
arguments of hemdistsign() are not always 64-bit aligned. So do the
64-bit chunks only if both pointers are 8 byte-aligned.

This results in speed-up in Gist index creation for tsvectors. With
default siglen (124), the speed up is 12-20%. With siglen=700, it is
30-50%. So with longer signature lengths, we get higher percentage
speed-up.

Similar results are seen in other types using gist index, such as
intarray, hstore, and ltree that are availale in contrib.

With smaller siglens such as 10, 20 etc, there is a bit of a reduction
in speed by 1-7% if we use this optimization. It's probably because of
an extra function call for pg_xorcount(); and also might be due to
the extra logic in pg_xorcount(). So for siglen less than 32,
keep the existing method using byte-by-byte traversal.
---
 contrib/hstore/hstore_gist.c  | 17 ++--
 contrib/intarray/_intbig_gist.c   | 18 +---
 contrib/ltree/_ltree_gist.c   | 19 ++---
 src/backend/utils/adt/tsgistidx.c | 26 +--
 src/include/port/pg_bitutils.h| 17 
 src/port/pg_bitutils.c| 34 +++
 6 files changed, 61 insertions(+), 70 deletions(-)

diff --git a/contrib/hstore/hstore_gist.c b/contrib/hstore/hstore_gist.c
index 102c9cea72..4970a0a2f0 100644
--- a/contrib/hstore/hstore_gist.c
+++ b/contrib/hstore/hstore_gist.c
@@ -8,6 +8,7 @@
 #include "access/stratnum.h"
 #include "catalog/pg_type.h"
 #include "hstore.h"
+#include "port/pg_bitutils.h"
 #include "utils/pg_crc.h"
 
 /* gist_hstore_ops opclass options */
@@ -256,20 +257,6 @@ sizebitvec(BITVECP sign, int siglen)
 	return size;
 }
 
-static int
-hemdistsign(BITVECP a, BITVECP b, int siglen)
-{
-	int			i,
-dist = 0;
-
-	L

Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-12 Thread Amit Khandekar
On Tue, 9 Mar 2021 at 17:14, John Naylor  wrote:
> On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar  wrote:
> > Just a quick question before I move on to review the patch ... The
> > improvement looks like it is only meant for x86 platforms.
>
> Actually it's meant to be faster for all platforms, since the C fallback is 
> quite a bit different from HEAD. I've found it to be faster on ppc64le. An 
> earlier version of the patch was a loser on 32-bit Arm because of alignment 
> issues, but if you could run the test script attached to [1] on 64-bit Arm, 
> I'd be curious to see how it does on 0002, and whether 0003 and 0004 make 
> things better or worse. If there is trouble building on non-x86 platforms, 
> I'd want to fix that also.

On my Arm64 VM :

HEAD :
 mixed | ascii
---+---
  1091 |   628
(1 row)

PATCHED :
 mixed | ascii
---+---
   681 |   119

So the fallback function does show improvements on Arm64.

I guess, if at all we use the equivalent Arm NEON intrinsics, the
"mixed" figures will be close to the "ascii" figures, going by your
figures on x86.

> > Can this be
> > done in a portable way by arranging for auto-vectorization ? Something
> > like commit 88709176236caf. This way it would benefit other platforms
> > as well.
>
> I'm fairly certain that the author of a compiler capable of doing that in 
> this case would be eligible for some kind of AI prize. :-)

:)

I was not thinking about auto-vectorizing the code in
pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization
inside the individual helper functions that you wrote, such as
_mm_setr_epi8(), shift_right(), bitwise_and(), prev1(), splat(),
saturating_sub() etc. I myself am not sure whether it is feasible to
write code that auto-vectorizes all these function definitions.
saturating_sub() seems hard, but I could see the gcc docs mentioning
support for generating such instructions for a particular code loop.
But for the index lookup function() it seems impossible to generate
the  needed index lookup intrinsics. We can have platform-specific
function definitions for such exceptional cases.

I am considering this only because that would make the exact code work
on other platforms like arm64 and ppc, and won't have to have
platform-specific files. But I understand that it is easier said than
done. We will have to process the loop in pg_validate_utf8_sse42() in
128-bit chunks, and pass each chunk to individual functions, which
could mean extra work and extra copy in extracting the chunk data and
passing it around, which may make things drastically slow. You are
passing around the chunks using __m128i type, so perhaps it means
passing around just a reference to the simd registers. Not sure.


-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-09 Thread Amit Khandekar
Hi,

Just a quick question before I move on to review the patch ... The
improvement looks like it is only meant for x86 platforms. Can this be
done in a portable way by arranging for auto-vectorization ? Something
like commit 88709176236caf. This way it would benefit other platforms
as well.

I tried to compile the following code using -O3, and the assembly does
have vectorized instructions.

#include 
int main()
{
int i;
char s1[200] = "abcdewhruerhetr";
char s2[200] = "oweurietiureuhtrethre";
char s3[200] = {0};

for (i = 0; i < sizeof(s1); i++)
{
s3[i] = s1[i] ^ s2[i];
}

printf("%s\n", s3);
}




Re: Speeding up GIST index creation for tsvectors

2021-03-08 Thread Amit Khandekar
On Wed, 3 Mar 2021 at 23:32, John Naylor  wrote:
> Your performance numbers look like this is a fruitful area to improve. I have 
> not yet tested performance, but I will do so at a later date.

Thanks for reviewing the patch !

> I did some
> microbenchmarking of our popcount implementation, since I wasn't quite sure
> it's optimal, and indeed, there is room for improvement there [1]. I'd be
> curious to hear your thoughts on those findings. I think it'd be worth it to
> test a version of this patch using those idioms here as well, so at some
> point I plan to post something.

I am not yet clear about the implications of that work on this patch
set here, but I am still going over it, and will reply to that.

>
> Now for the patches:
>
> 0001:
>
> + /*
> + * We can process 64-bit chunks only if both are mis-aligned by the same
> + * number of bytes.
> + */
> + if (b_aligned - b == a_aligned - a)
>
> The obvious question here is: how often are they identically misaligned? You
> don't indicate that your measurements differ in a bimodal fashion, so does
> that mean they happen to be always (mis)aligned?

I ran CREATE INDEX on tsvector columns using the tsearch.data that I
had attached upthread, with some instrumentation; here are the
proportions :
1. In 15% of the cases, only one among a and b was aligned. The other
was offset from the 8-byte boundary by 4 bytes.
2. 6% of the cases, both were offset by 4 bytes, i.e. identically misaligned.
3. Rest of the cases, both were aligned.

With other types, and with different sets of data, I believe I can get
totally different proportions.

> Is that an accident of the gist coding and could change unexpectedly?
> And how hard would it be to
> allocate those values upstream so that the pointers are always aligned on
> 8-byte boundaries? (I imagine pretty hard, since there are multiple callers,
> and such tight coupling is not good style)

That I am not sure though; haven't clearly understood the structure of
gist indexes yet. I believe it would depend on individual gist
implementation, and we can't assume about that ?


> + /* For smaller lengths, do simple byte-by-byte traversal */
> + if (bytes <= 32)
>
> You noted upthread:
>
> > Since for the gist index creation of some of these types the default
> > value for siglen is small (8-20), I tested with small siglens. For
> > siglens <= 20, particularly for values that are not multiples of 8
> > (e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
> > creation. It's probably because of
> > an extra function call for pg_xorcount(); and also might be due to the
> > extra logic in pg_xorcount() which becomes prominent for shorter
> > traversals. So for siglen less than 32, I kept the existing method
> > using byte-by-byte traversal.
>
> I wonder if that can be addressed directly, while cleaning up the loops and
> alignment checks in pg_xorcount_long() a little. For example, have a look at
> pg_crc32c_armv8.c -- it might be worth testing a similar approach.

Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
I avoided that to not hamper the <= 32 scenarios. Details explained
below for "why inline pg_xorcount is calling global function"

> Also, pardon my ignorance, but where can I find the default siglen for 
> various types?
Check SIGLEN_DEFAULT.

>
> +/* Count the number of 1-bits in the result of xor operation */
> +extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char 
> *b,
> +   int bytes);
> +static inline uint64 pg_xorcount(const unsigned char *a, const unsigned char 
> *b,
> + int bytes)
>
> I don't think it makes sense to have a static inline function call a global 
> function.

As you may note, the global function will be called only in a subset
of cases where siglen <= 32. Yeah, we can put the bytes <= 32
condition inside pg_xorcount_long(). I avoided that to not hamper the
<= 32 scenarios. If I do this, it will add a function call for these
small siglen scenarios. The idea was: use the function call only for
cases where we know that the function call overhead will be masked by
the popcount() optimization.


>
> -static int
> +static inline int
>  hemdistsign(BITVECP a, BITVECP b, int siglen)
>
>  Not sure what the reason is for inlining all these callers.
> Come to think of it, I don't see a reason to have hemdistsign()
> at all anymore. All it does is call pg_xorcount(). I suspect that's
> what Andrey Borodin meant when he said upthread:
>
>  > > > Meanwhile there are at least 4 incarnation of hemdistsign()
> > > > functions that are quite similar. I'd propose to refactor them 
> > > > somehow...

I had something in mind when I finally decided to not remove
hemdistsign(). Now I think you are right, we can remove hemdistsign()
altogether. Let me check again.






> 0002:
>
> I'm not really happy with this patch. I like the idea of keeping indirect
> calls out of non-x86 platforms, but I believe it could be done more simply.

I am open for 

Re: Speeding up GIST index creation for tsvectors

2021-02-28 Thread Amit Khandekar
I have added this one in the March commitfest.
https://commitfest.postgresql.org/32/3023/




Re: Speeding up GIST index creation for tsvectors

2021-01-27 Thread Amit Khandekar
On Tue, 15 Dec 2020 at 20:34, Amit Khandekar  wrote:
>
> On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin  wrote:
> > +1
> > This will make all INSERTs and UPDATES for tsvector's GiSTs.
>
> Oh, I didn't realize that this code is getting used in GIST index
> insertion and creation too. Will check there.

I ran some insert and update tests; they show only marginal
improvement. So looks like the patch is mainly improving index builds.

> > Meanwhile there are at least 4 incarnation of hemdistsign() functions that 
> > are quite similar. I'd propose to refactor them somehow...
>
> Yes, I hope we get the benefit there also. Before that, I thought I
> should post the first use-case to get some early comments. Thanks for
> your encouraging comments :)

The attached v2 version of 0001 patch extends the hemdistsign()
changes to the other use cases like  intarray, ltree and hstore. I see
the same index build improvement for all these types.

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

--
Thanks,
-Amit Khandekar
Huawei Technologies
From b8905b47f7e0af8d4558fdac73cc2283c263cbcc Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Thu, 10 Dec 2020 21:45:49 +0800
Subject: [PATCH 2/2] Avoid function pointer dereferencing for
 pg_popcount32/64() call.

With USE_POPCNT_ASM set (which is true only on x86), we decide with
the help of __get_cpuid() at runtime whether the CPU supports popcnt
instruction, and hence we dynamically choose the corresponding
function; thus pg_popcount32/64 is a function pointer. But for
platforms with USE_POPCNT_ASM not set, we don't have to use dynamic
assignment of functions, but we were still declaring pg_popcount64 as a
function pointer. So arrange for direct function call so as to get rid
of function pointer dereferencing each time pg_popcount32/64 is
called.

To do this, define pg_popcount64 to another function name
(pg_popcount64_nonasm) rather than a function pointer, whenever
USE_POPCNT_ASM is not defined.  And let pg_popcount64_nonasm() be a
static inline function so that whenever pg_popcount64() is called,
the __builtin_popcount() gets called directly.  For platforms not
supporting __builtin_popcount(), continue using the slow version as is
the current behaviour. pg_popcount64_slow() was actually a misnomer,
because it was actually calling __builtin_popcount() which is not a slow
function. Now with this commit, pg_popcount64_slow() indeed has only
slow code.

Tested this on ARM64, and the gist index creation for tsvectors speeds
up by ~ 6% for default siglen (=124), and by 12% with siglen=700
---
 src/include/port/pg_bitutils.h | 59 ++
 src/port/pg_bitutils.c | 47 +--
 2 files changed, 67 insertions(+), 39 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 174df28e66..b039f94ee5 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -23,6 +23,19 @@ extern const uint8 pg_rightmost_one_pos[256];
 extern const uint8 pg_number_of_ones[256];
 #endif
 
+/*
+ * On x86_64, we can use the hardware popcount instruction, but only if
+ * we can verify that the CPU supports it via the cpuid instruction.
+ *
+ * Otherwise, we fall back to __builtin_popcount if the compiler has that,
+ * or a hand-rolled implementation if not.
+ */
+#ifdef HAVE_X86_64_POPCNTQ
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
+#define USE_POPCNT_ASM 1
+#endif
+#endif
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -208,8 +221,54 @@ pg_ceil_log2_64(uint64 num)
 }
 
 /* Count the number of one-bits in a uint32 or uint64 */
+
+/*
+ * With USE_POPCNT_ASM set (which is true only on x86), we decide at runtime
+ * whether the CPU supports popcnt instruction, and hence we dynamically choose
+ * the corresponding function; thus pg_popcount32/64 is a function pointer. But
+ * for platforms with USE_POPCNT_ASM not set, we don't have to use dynamic
+ * assignment of functions, so we arrange for direct function call so as to get
+ * rid of function pointer dereferencing each time pg_popcount32/64 is called.
+ */
+#ifdef USE_POPCNT_ASM
 extern int	(*pg_popcount32) (uint32 word);
 extern int	(*pg_popcount64) (uint64 word);
+#else
+#define pg_popcount32 pg_popcount32_nonasm
+#define pg_popcount64 pg_popcount64_nonasm
+#endif
+
+/* Slow versions of pg_popcount */
+#ifndef HAVE__BUILTIN_P

Re: Speeding up GIST index creation for tsvectors

2020-12-15 Thread Amit Khandekar
On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin  wrote:
> +1
> This will make all INSERTs and UPDATES for tsvector's GiSTs.

Oh, I didn't realize that this code is getting used in GIST index
insertion and creation too. Will check there.

> Also I really like idea of taking advantage of hardware capabilities like 
> __builtin_* etc wherever possible.

Yes. Also, the __builtin_popcount() uses SIMD vectorization (on arm64
: "cnt  v0.8b, v0.8b"), hence there's all the more reason to use it.
Over and above that, I had thought that if we can auto-vectorize the
byte-by-byte xor operation and the popcount() call using compiler
optimizations, we would benefit out of this, but didn't see any more
improvement. I hoped for the benefit because that would have allowed
us to process in 128-bit chunks or 256-bit chunks, since the vector
registers are at least that long. Maybe gcc is not that smart to
translate __builtin_popcount() to 128/256 bit vectorized instruction.
But for XOR operator, it does translate to 128bit vectorized
instructions (on arm64 : "eor  v2.16b, v2.16b, v18.16b")

> Meanwhile there are at least 4 incarnation of hemdistsign() functions that 
> are quite similar. I'd propose to refactor them somehow...

Yes, I hope we get the benefit there also. Before that, I thought I
should post the first use-case to get some early comments. Thanks for
your encouraging comments :)




Re: Speeding up GIST index creation for tsvectors

2020-12-13 Thread Amit Khandekar
On Thu, 10 Dec 2020 at 20:43, Pavel Borisov  wrote:
>
> Hi, Amit!
> It's really cool to hear about another GiST improvement proposal. I'd like to 
> connect recently committed GiST ordered build discussion here [1] and further 
> improvement proposed [2]
>
> I've tested feature [1] and got 2.5-3 times speed improvement which is much 
> better I believe.

Yeah, I am completely new to the GIST stuff, but I had taken a quick
look at the sortsupport feature for GIST, and found it very
interesting. I believe it's an additional option for making the gist
index builds much faster. But then I thought that my small patch would
still be worthwhile because for tsvector types the non-sort method for
index build would continue to be used by users, and in general we can
extend this small optimization for other gist types also.

> There is an ongoing activity [2] to build support for different data types 
> for GiST. Maybe you will consider it interesting to join.
>
> BTW you may have heard about Gin and Rum [3] indexes which suit text search 
> much, much better (and faster) than GiST. The idea to process data in bigger 
> chunks is good. Still optimize index structure, minimizing disc pages access, 
> etc. seems better in many cases.

Sure. Thanks for the pointers.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: Improving spin-lock implementation on ARM.

2020-12-06 Thread Amit Khandekar
On Sat, 5 Dec 2020 at 02:55, Alexander Korotkov  wrote:
>
> On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar  wrote:
> > Let me know what do you think about this analysis and any specific 
> > direction that we should consider to help move forward.
>
> BTW, it would be also nice to benchmark my lwlock patch on the
> Kunpeng.  I'm very optimistic about this patch, but it wouldn't be
> fair to completely throw it away.  It still might be useful for
> LSE-disabled builds.

Coincidentally I was also looking at some hotspot locations around
LWLockAcquire() and LWLockAttempt() for read-only work-loads on both
arm64 and x86, and had narrowed down to the place where
pg_atomic_compare_exchange_u32() is called. So it's likely we are
working on the same hotspot area. When I get a chance, I do plan to
look at your patch while I myself am trying to see if we can do some
optimizations. Although, this is unrelated to the optimization of this
mail thread, so this will need a different mail thread.



-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Amit Khandekar
On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar  wrote:
> What I meant was outline-atomics support was added in GCC-9.4 and was made 
> default in gcc-10.
> LSE support is present for quite some time.

FWIW, here is an earlier discussion on the same (also added the
proposal author here) :

https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com




Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Amit Khandekar
On Thu, 26 Nov 2020 at 10:55, Krunal Bauskar  wrote:
> Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for 
> server and 8 for client) [2 numa nodes]
> Storage: 3.2 TB NVMe SSD
> OS: CentOS Linux release 7.6
> PGSQL: baseline = Release Tag 13.1
> Invocation suite: 
> https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses 
> pgbench)

Using the same hardware, attached are my improvement figures, which
are pretty much in line with your figures. Except that, I did not run
for more than 400 number of clients. And, I am getting some
improvement even for select-only workloads, in case of 200-400
clients. For read-write load,  I had seen that the s_lock() contention
was caused when the XLogFlush() uses the spinlock. But for read-only
case, I have not analyzed where the improvement occurred.

The .png files in the attached tar have the graphs for head versus patch.

The GUCs that I changed :

work_mem=64MB
shared_buffers=128GB
maintenance_work_mem = 1GB
min_wal_size = 20GB
max_wal_size = 100GB
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9
full_page_writes = on
synchronous_commit = on
effective_io_concurrency = 200
log_checkpoints = on

For backends, 64 CPUs were allotted (covering 2 NUMA nodes) , and for
pgbench clients a separate set of 28 CPUs were allotted on a different
socket. Server was pre_warmed().


results.tar.gz
Description: application/gzip


Re: Redundant tuple copy in tqueueReceiveSlot()

2020-09-18 Thread Amit Khandekar
On Thu, 17 Sep 2020 at 08:55, Andres Freund  wrote:
>
> Hi,
>
> On 2020-09-17 14:20:50 +1200, Thomas Munro wrote:
> > I wonder if there is a way we could make "Minimal Tuples but with the
> > length travelling separately (and perhaps chopped off?)" into a
> > first-class concept...  It's also a shame to be schlepping a bunch of
> > padding bytes around.

Yeah, I think we can pass a  "length" data separately, but since the
receiver end already is assuming that it knows the received data is a
minimal tuple, I thought why not skip passing this redundant
component. But anyways, if you and Andres are suggesting that being
able to skip the copy is important for virtual tuples as well, then I
think the approach you suggested (supplying an allocated memory to the
tuple API for conversion) would be one of the better options with us,
if not the only good option. Maybe I will try looking into the shm_mq
working to see if we can come up with a good solution.

>
> There really is no justification for having MinimalTuples, as we have
> them today at least, anymore. We used to rely on being able to construct
> pointers to MinimalTuples that are mostly compatible with HeapTuple. But
> I think there's none of those left since PG 12.

Ah ok.

>
> I think it'd make a bit more sense to do some steps towards having a
> more suitable "minimal" tuple representation, rather than doing this
> local, pretty ugly, hacks. A good way would be to just starting to
> remove the padding, unnecessary fields etc from MinimalTuple.

So there are two things we wish to do :
1. Prevent an extra tuple forming step before sending minimal tuple
data. Possibly device an shm_mq API to get memory to write tuple of a
given length, and device something like
FormMinimalTupleDataInHere(memory_allocated_by_shm_mq) which will
write minimal tuple data.
2. Shrink the MinimalTupleData structure because it no longer needs
the current padding etc and we can substitute this new MinimalTuple
structure with the current one all over the code wherever it is
currently being used.

If we remove the unnecessary fields from the tuple data being sent to
Gather node, then we need to again form a MinimalTuple at the
receiving end, which again adds an extra tuple forming. So I
understand, that's the reason why you are saying we should shrink the
MinimalTupleData structure itself, in which case we will continue to
use the received new MinimalTupledata as an already-formed tuple, like
how we are doing now.

Now, the above two things (1. and 2.) look independent to me. Suppose
we first do 1. i.e. we come up with a good way to form an in-place
MinimalTuple at the sender's end, without any change to the
MinimalTupleData. And then when we do 2. i.e. shrink the
MinimalTupleData; but for that, we won't require any change in the
in-place-tuple-forming API we wrote in 1. . Just the existing
underlying function heap_form_minimal_tuple() or something similar
might need to be changed. At least that's what I feel right now.

>
> I also think that it'd be good to look at a few of the other places that
> are heavily bottlenecked by MinimalTuple overhead before designing new
> API around this. IIRC it's e.g. very easy to see hash joins spending a
> lot of time doing MinimalTuple copies & conversions.

Yeah, makes sense. The above FormMinimalTupleDataInHere() should be
able to be used for these other places as well. Will keep that in
mind.

>
> > > Thomas, I guess you had a different approach in mind when you said
> > > about "returning either success or
> > > hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
> >
> > Yeah I tried some things like that, but I wasn't satisfied with any of
> > them; basically the extra work involved in negotiating the size was a
> > bit too high.

Hmm, ok. Let me see if there is any way around this.

>> On the other hand, because my interface was "please
> > write a MinimalTuple here!", it had the option to *form* a
> > MinimalTuple directly in place, whereas your approach can only avoid
> > creating and destroying a temporary tuple when the source is a heap
> > tuple.
True.
>
> There's a lot of cases where the source is a virtual slot (since we'll
> often project stuff below Gather). So it'd be quite advantageous to
> avoid building an unnecessary HeapTuple in those cases.

Yeah right.




Re: calling procedures is slow and consumes extra much memory against calling function

2020-09-17 Thread Amit Khandekar
On Thu, 17 Sep 2020 at 11:07, Michael Paquier  wrote:
>
> On Thu, Jul 16, 2020 at 09:08:09PM +0200, Pavel Stehule wrote:
> > I am sending another patch that tries to allow CachedPlans for CALL
> > statements. I think this patch is very accurate, but it is not nice,
> > because it is smudging very precious reference counting for CachedPlans.
>
> Amit, you are registered as a reviewer of this patch for two months
> now.  Are you planning to look at it more?  If you are not planning to
> do so, that's fine, but it may be better to remove your name as
> reviewer then.

Thanks Michael for reminding. I *had* actually planned to do some more
review. But I think I might end up not getting bandwidth for this one
during this month. So I have removed my name. But I have kept my name
as reviewer for bitmaps and correlation :
"https://commitfest.postgresql.org/29/2310/ since I do plan to do some
review on that one.

Thanks,
-Amit Khandekar
Huawei Technologies




Redundant tuple copy in tqueueReceiveSlot()

2020-09-08 Thread Amit Khandekar
Hi,

I am starting a new thread that continues with the following point
that was discussed in [1] 

On Fri, 17 Jul 2020 at 09:05, Thomas Munro  wrote:
>
> On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty
>  wrote:
> > Do you mean that we should have an implementation for
> > get_minimal_tuple() for the heap AM and have it return a pointer to the
> > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
> > tqueueReceiveSlot() will ensure that the heap tuple from which it wants
> > to extract the minimal tuple was allocated in the tuple queue in the
> > first place? If we consider that the node directly below a gather is a
> > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
> > ss_ScanTupleSlot to point to memory in the shared tuple queue?
> > Similarly, other ExecInit*() methods can do the same for other executor
> > nodes that involve parallelism? Of course, things would be slightly
> > different for
> > the other use cases you mentioned (such as hash table population)
>
> What I mean is that where ExecHashTableInsert() and
> tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a
> freshly allocated copy, and then you copy that again, and free it.
> There may be something similar going on in tuplestore and sort code.
> Perhaps we could have something like
> ExecFetchSlotMinimalTupleInPlace(slot, output_buffer,
> output_buffer_size) that returns a value that indicates either success
> or hey-that-buffer's-too-small-I-need-N-bytes, or something like that.
> That silly extra copy is something Andres pointed out to me in some
> perf results involving TPCH hash joins, a couple of years ago.

I went ahead and tried doing this. I chose an approach where we can
return the pointer to the in-place minimal tuple data if it's a
heap/buffer/minimal tuple slot. A new function
ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data.
If it's neither heap, buffer or minimal tuple, it returns a copy as
usual. The receiver should not assume the data is directly taken from
MinimalTupleData, so it should set it's t_len to the number of bytes
returned. Patch attached
(0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch)

Thomas, I guess you had a different approach in mind when you said
about "returning either success or
hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
me is that avoiding the copy shows consistent improvement of 4 to 10%
for simple parallel table scans. I tried my patch on both x86_64 and
arm64, and observed this speedup on both.

create table tab as select generate_series(1, 2000) id,
'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v;
select pg_prewarm('tab'::regclass);
explain analyze select * from tab where id %2 = 0;
Times in milli-secs :
HEAD : 1833.119  1816.522  1875.648  1872.153  1834.383
Patch'ed : 1763.786  1721.160  1714.665  1719.738  1704.478
This was with the default 2 parallel workers. With 3 or 4 workers, for
the above testcase I didn't see a noticeable difference. I think, if I
double the number of rows, the difference will be noticeable. In any
case, the gain would go on reducing with the number of workers,
because the tuple copy also gets parallelized. In some scenarios,
parallel_leader_participation=off causes the difference to amplify.

Haven't had a chance to see if this helps any of the TPC-H queries.

Also attached is a patch guc_for_testing.patch that I used for testing
the gain. This patch is only for testing. Without this, in order to
compare the performance figures it requires server restart, and the
figures anyway shift back and forth by 5-15 percent after each
restart, which creates lot of noise when comparing figures with and
without fix. Use this GUC enable_fix to enable/disable the fix.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 5d19626e35e50a5630e5f1a042f7ecee6acb7c70 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Wed, 9 Sep 2020 12:03:01 +0800
Subject: [PATCH 2/2] Add guc for ease of testing speedup.

This is only for testing the performance gain. Otherwise, to compare
the performance figures, it requires server restart, and the figures
anyway shift back and forth by 5-15 percent after each restart, which
creates lot of noise when comparing figures with and without fix.

With this, we can easily see at least 4-10% difference in execution
times by setting/unsetting the GUC enable_fix.
---
 src/backend/executor/tqueue.c | 12 
 src/backend/utils/misc/guc.c  | 12 
 2 files changed, 24 insertions(+)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index c70ee0f39a..7dd60b5c7e 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-09-07 Thread Amit Khandekar
On Tue, 8 Sep 2020 at 02:19, Tom Lane  wrote:
>
> I wrote:
> > I experimented with a few different ideas such as adding restrict
> > decoration to the pointers, and eventually found that what works
> > is to write the loop termination condition as "i2 < limit"
> > rather than "i2 <= limit".  It took me a long time to think of
> > trying that, because it seemed ridiculously stupid.  But it works.

Ah ok.

I checked the "Auto-Vectorization in LLVM" link that you shared. All
the examples use "< n" or "> n". None of them use "<= n". Looks like a
hidden restriction.

>
> I've done more testing and confirmed that both gcc and clang can
> vectorize the improved loop on aarch64 as well as x86_64.  (clang's
> results can be confusing because -ftree-vectorize doesn't seem to
> have any effect: its vectorizer is on by default.  But if you use
> -fno-vectorize it'll go back to the old, slower code.)
>
> The only buildfarm effect I've noticed is that locust and
> prairiedog, which are using nearly the same ancient gcc version,
> complain
>
> c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing 
> is ignored when Auto Vectorization is used.
>
> which is expected (they say the same for checksum.c), but then
> there are a bunch of
>
> warning: dereferencing type-punned pointer will break strict-aliasing rules
>
> which seems worrisome.  (This sort of thing is the reason I'm
> hesitant to apply higher optimization levels across the board.)
> Both animals pass the regression tests anyway, but if any other
> compilers treat -ftree-vectorize as an excuse to apply stricter
> optimization assumptions, we could be in for trouble.
>
> I looked closer and saw that all of those warnings are about
> init_var(), and this change makes them go away:
>
> -#define init_var(v)MemSetAligned(v, 0, sizeof(NumericVar))
> +#define init_var(v)memset(v, 0, sizeof(NumericVar))
>
> I'm a little inclined to commit that as future-proofing.  It's
> essentially reversing out a micro-optimization I made in d72f6c750.
> I doubt I had hard evidence that it made any noticeable difference;
> and even if it did back then, modern compilers probably prefer the
> memset approach.

Thanks. I must admit it did not occur to me that I could have very
well installed clang on my linux machine and tried compiling this
file, or tested with some older gcc versions. I think I was using gcc
8. Do you know what was the gcc compiler version that gave these
warnings ?

-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-09-07 Thread Amit Khandekar
On Mon, 7 Sep 2020 at 11:23, Tom Lane  wrote:
>
> I wrote:
> > I made some cosmetic changes to this and committed it.

Thanks!

>
> BTW, poking at this further, it seems that the patch only really
> works for gcc.  clang accepts the -ftree-vectorize switch, but
> looking at the generated asm shows that it does nothing useful.
> Which is odd, because clang does do loop vectorization.
>
> I tried adding -Rpass-analysis=loop-vectorize and got
>
> numeric.c:8341:3: remark: loop not vectorized: could not determine number of 
> loop iterations [-Rpass-analysis=loop-vectorize]
> for (i2 = 0; i2 <= i; i2++)

Hmm, yeah that's unfortunate. My guess is that the compiler would do
vectorization only if 'i' is a constant, which is not true for our
case.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: Re: [HACKERS] Custom compression methods

2020-08-30 Thread Amit Khandekar
On Thu, 13 Aug 2020 at 17:18, Dilip Kumar  wrote:
> I have rebased the patch on the latest head and currently, broken into 3 
> parts.
>
> v1-0001: As suggested by Robert, it provides the syntax support for
> setting the compression method for a column while creating a table and
> adding columns.  However, we don't support changing the compression
> method for the existing column.  As part of this patch, there is only
> one built-in compression method that can be set (pglz).  In this, we
> have one in-build am (pglz) and the compressed attributes will directly
> store the oid of the AM.  In this patch, I have removed the
> pg_attr_compresion as we don't support changing the compression
> for the existing column so we don't need to preserve the old
> compressions.
> v1-0002: Add another built-in compression method (zlib)
> v1:0003: Remaining patch set (nothing is changed except rebase on the
> current head, stabilizing check-world  and  0001 and 0002 are pulled
> out of this)
>
> Next, I will be working on separating out the remaining patches as per
> the suggestion by Robert.

Thanks for this new feature. Looks promising and very useful, with so
many good compression libraries already available.

I see that with the patch-set, I would be able to create an extension
that defines a PostgreSQL C handler function which assigns all the
required hook function implementations for compressing, decompressing
and validating, etc. In short, I would be able to use a completely
different compression algorithm to compress toast data if I write such
an extension. Correct me if I am wrong with my interpretation.

Just a quick superficial set of review comments 

A minor re-base is required due to a conflict in a regression test

-

In heap_toast_insert_or_update() and in other places, the comments for
new parameter preserved_am_info are missing.

-

+toast_compress_datum(Datum value, Oid acoid)
 {
struct varlena *tmp = NULL;
int32 valsize;
-   CompressionAmOptions cmoptions;
+   CompressionAmOptions *cmoptions = NULL;

I think tmp and cmoptions need not be initialized to NULL

-

- TOAST_COMPRESS_SET_RAWSIZE(tmp, valsize);
- SET_VARSIZE_COMPRESSED(tmp, len + TOAST_COMPRESS_HDRSZ);
  /* successful compression */
+ toast_set_compressed_datum_info(tmp, amoid, valsize);
  return PointerGetDatum(tmp);

Any particular reason why is this code put in a new extern function ?
Is there a plan to re-use it ? Otherwise, it's not necessary to do
this.



Also, not sure why "HTAB *amoptions_cache" and "MemoryContext
amoptions_cache_mcxt" aren't static declarations. They are being used
only in toast_internals.c

---

The tab-completion doesn't show COMPRESSION :
postgres=# create access method my_method TYPE
INDEX  TABLE
postgres=# create access method my_method TYPE

Also, the below syntax also would better be tab-completed  so as to
display all the installed compression methods, in line with how we
show all the storage methods like plain,extended,etc:
postgres=# ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION



I could see the differences in compression ratio, and the compression
and decompression speed when I use lz versus zib :

CREATE TABLE zlibtab(t TEXT COMPRESSION zlib WITH (level '4'));
create table lztab(t text);
ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION pglz;

pgg:s2:pg$ time psql -c "\copy zlibtab from text.data"
COPY 13050

real0m1.344s
user0m0.031s
sys 0m0.026s

pgg:s2:pg$ time psql -c "\copy lztab from text.data"
COPY 13050

real0m2.088s
user0m0.008s
sys 0m0.050s


pgg:s2:pg$ time psql -c "select pg_table_size('zlibtab'::regclass),
pg_table_size('lztab'::regclass)"
 pg_table_size | pg_table_size
---+---
   1261568 |   1687552

pgg:s2:pg$ time psql -c "select NULL from zlibtab where t like ''"
 > /dev/null

real0m0.127s
user0m0.000s
sys 0m0.002s

pgg:s2:pg$ time psql -c "select NULL from lztab where t like ''"
> /dev/null

real0m0.050s
user0m0.002s
sys 0m0.000s


-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-07-21 Thread Amit Khandekar
On Mon, 13 Jul 2020 at 14:27, Amit Khandekar  wrote:
> I tried this in utils/adt/Makefile :
> +
> +numeric.o: CFLAGS += ${CFLAGS_VECTOR}
> +
> and it works.
>
> CFLAGS_VECTOR also includes the -funroll-loops option, which I
> believe, had showed improvements in the checksum.c runs ( [1] ). This
> option makes the object file a bit bigger. For numeric.o, it's size
> increased by 15K; from 116672 to 131360 bytes. I ran the
> multiplication test, and didn't see any additional speed-up with this
> option. Also, it does not seem to be related to vectorization. So I
> was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and
> CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and
> numeric.c can use only CFLAGS_VECTOR.

I did as above. Attached is the v2 patch.

In case of existing CFLAGS_VECTOR, an env variable also could be set
by that name when running configure. I did the same for
CFLAGS_UNROLL_LOOPS.

Now, developers who already are using CFLAGS_VECTOR env while
configur'ing might be using this env because their compilers don't
have these compiler options  so they must be using some equivalent
compiler options. numeric.c will now be compiled with CFLAGS_VECTOR,
so for them  it will now be compiled with their equivalent of
vectorize and unroll-loops option, which is ok, I think. Just that the
numeric.o size will be increased, that's it.

>
> [1] 
> https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e




-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 029373ab2e9d2e6802326871f248ba2f21ffb204 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Tue, 21 Jul 2020 16:42:51 +0800
Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric
 product

A 'for' loop in mul_var() runs backwards by decrementing two
variables. This prevents the gcc compiler from auto-vectorizing the
for loop. So make it a forward loop with a single variable. This gives
performance benefits for product of numeric types with large
precision, with speedups becoming noticeable from values
with precisions starting from 20-40. Typical pattern of benefit is :
precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on.
On some CPU architectures, the speedup starts from 20 precision
onwards.
With the precisions used in the numeric_big regression test, the
multiplication speeds up by 2.5 to 2.7 times.

Auto-vectorization happens with gcc -O3 flag or -ftree-loop-vectorize.
So arrange for -free-loop-vectorize flag specifically when compiling
numeric.c. CFLAGS_VECTOR was already present for similar
functionality in checksum.c, but CFLAGS_VECTOR also includes
-funroll-loops which unecessarily makes numeric.o larger. So split
CFLAGS_VECTOR into CFLAGS_UNROLL_LOOPS and CFLAGS_VECTOR.
---
 configure | 17 +++--
 configure.in  |  9 +++--
 src/Makefile.global.in|  1 +
 src/backend/storage/page/Makefile |  2 +-
 src/backend/utils/adt/Makefile|  3 +++
 src/backend/utils/adt/numeric.c   | 11 ---
 6 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/configure b/configure
index cb8fbe1051..36ca734ad5 100755
--- a/configure
+++ b/configure
@@ -734,6 +734,7 @@ CPP
 CFLAGS_SL
 BITCODE_CXXFLAGS
 BITCODE_CFLAGS
+CFLAGS_UNROLL_LOOPS
 CFLAGS_VECTOR
 PERMIT_DECLARATION_AFTER_STATEMENT
 LLVM_BINPATH
@@ -5266,10 +5267,13 @@ BITCODE_CFLAGS=""
 user_BITCODE_CXXFLAGS=$BITCODE_CXXFLAGS
 BITCODE_CXXFLAGS=""
 
-# set CFLAGS_VECTOR from the environment, if available
+# set CFLAGS_VECTOR and CFLAGS_UNROLL_LOOPS from the environment, if available
 if test "$ac_env_CFLAGS_VECTOR_set" = set; then
   CFLAGS_VECTOR=$ac_env_CFLAGS_VECTOR_value
 fi
+if test "$ac_env_CFLAGS_UNROLL_LOOPS_set" = set; then
+  CFLAGS_UNROLL_LOOPS=$ac_env_CFLAGS_UNROLL_LOOPS_value
+fi
 
 # Some versions of GCC support some additional useful warning flags.
 # Check whether they are supported, and add them to CFLAGS if so.
@@ -6102,16 +6106,16 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fexcess_precision_standard" = x"yes"; then
 fi
 
 
-  # Optimization flags for specific files that benefit from vectorization
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR" >&5
-$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR... " >&6; }
+  # Optimization flags for specific files that benefit from loop unrolling
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS" >&5
+$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS... " >&6; }
 if ${pgac_cv_prog_CC_cflags__funroll_loops+:} false; then :
   $as_echo_n "(cached) " >

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-07-13 Thread Amit Khandekar
On Fri, 10 Jul 2020 at 19:02, Tom Lane  wrote:
>
> Peter Eisentraut  writes:
> > We normally don't compile with -O3, so very few users would get the
> > benefit of this.
>
> Yeah.  I don't think changing that baseline globally would be a wise move.
>
> > We have CFLAGS_VECTOR for the checksum code.  I
> > suppose if we are making the numeric code vectorizable as well, we
> > should apply this there also.
>
> > I think we need a bit of a policy decision from the group here.
>
> I'd vote in favor of applying CFLAGS_VECTOR to specific source files
> that can benefit.  We already have experience with that and we've not
> detected any destabilization potential.

I tried this in utils/adt/Makefile :
+
+numeric.o: CFLAGS += ${CFLAGS_VECTOR}
+
and it works.

CFLAGS_VECTOR also includes the -funroll-loops option, which I
believe, had showed improvements in the checksum.c runs ( [1] ). This
option makes the object file a bit bigger. For numeric.o, it's size
increased by 15K; from 116672 to 131360 bytes. I ran the
multiplication test, and didn't see any additional speed-up with this
option. Also, it does not seem to be related to vectorization. So I
was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and
CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and
numeric.c can use only CFLAGS_VECTOR.

I was also wondering if it's worth to extract only the code that we
think can be optimized and keep it in a separate file (say
numeric_vectorize.c or adt_vectorize.c, which can have mul_var() to
start with), and use this file as a collection of all such code in the
adt module, and then we can add such files into other modules as and
when needed. For numeric.c, there can be already some scope for
auto-vectorizations in other similar regions in that file, so we don't
require a separate numeric_vectorize.c and just pass the CFLAGS_VECTOR
flag for this file itself.


[1] 
https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e

-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: calling procedures is slow and consumes extra much memory against calling function

2020-07-09 Thread Amit Khandekar
On Wed, 17 Jun 2020 at 13:54, Pavel Stehule  wrote:
>
>
>
> st 17. 6. 2020 v 7:52 odesílatel Amit Khandekar  
> napsal:
>>
>> On Wed, 10 Jun 2020 at 17:12, Pavel Stehule  wrote:
>> > st 10. 6. 2020 v 12:26 odesílatel Amit Khandekar  
>> > napsal:
>> >> Could you show an example testcase that tests this recursive scenario,
>> >> with which your earlier patch fails the test, and this v2 patch passes
>> >> it ? I am trying to understand the recursive scenario and the re-use
>> >> of expr->plan.
>> >
>> >
>> > it hangs on plpgsql tests. So you can apply first version of patch
>> >
>> > and "make check"
>>
>> I could not reproduce the make check hang with the v1 patch. But I
>> could see a crash with the below testcase. So I understand the purpose
>> of the plan_owner variable that you introduced in v2.
>>
>> Consider this recursive test :
>>
>> create or replace procedure p1(in r int) as $$
>> begin
>>RAISE INFO 'r : % ', r;
>>if r < 3 then
>>   call p1(r+1);
>>end if;
>> end
>> $$ language plpgsql;
>>
>> do $$
>> declare r int default 1;
>> begin
>> call p1(r);
>> end;
>> $$;
>>
>> In p1() with r=2, when the stmt "call p1(r+1)" is being executed,
>> consider this code of exec_stmt_call() with your v2 patch applied:
>> if (expr->plan && !expr->plan->saved)
>> {
>>if (plan_owner)
>>   SPI_freeplan(expr->plan);
>>expr->plan = NULL;
>> }
>>
>> Here, plan_owner is false. So SPI_freeplan() will not be called, and
>> expr->plan is set to NULL. Now I have observed that the stmt pointer
>> and expr pointer is shared between the p1() execution at this r=2
>> level and the p1() execution at r=1 level. So after the above code is
>> executed at r=2, when the upper level (r=1) exec_stmt_call() lands to
>> the same above code snippet, it gets the same expr pointer, but it's
>> expr->plan is already set to NULL without being freed. From this
>> logic, it looks like the plan won't get freed whenever the expr/stmt
>> pointers are shared across recursive levels, since expr->plan is set
>> to NULL at the lowermost level ? Basically, the handle to the plan is
>> lost so no one at the upper recursion level can explicitly free it
>> using SPI_freeplan(), right ? This looks the same as the main issue
>> where the plan does not get freed for non-recursive calls. I haven't
>> got a chance to check if we can develop a testcase for this, similar
>> to your testcase where the memory keeps on increasing.
>
>
> This is a good consideration.
>
> I am sending updated patch

Checked the latest patch. Looks like using a local plan rather than
expr->plan pointer for doing the checks does seem to resolve the issue
I raised. That made me think of another scenario :

Now we are checking for plan value and then null'ifying the expr->plan
value. What if expr->plan is different from plan ? Is it possible ? I
was thinking of such scenarios. But couldn't find one. As long as a
plan is always created with saved=true for all levels, or with
saved=false for all levels, we are ok. If we can have a mix of saved
and unsaved plans at different recursion levels, then expr->plan can
be different from the outer local plan because then the expr->plan
will not be set to NULL in the inner level, while the outer level may
have created its own plan. But I think a mix of saved and unsaved
plans are not possible. If you agree, then I think we should at least
have an assert that looks like :

if (plan && !plan->saved)
{
if (plan_owner)
SPI_freeplan(plan);

/* If expr->plan  is present, it must be the same plan that we
allocated */
   Assert ( !expr->plan || plan == expr->plan) );

expr->plan = NULL;
}

Other than this, I have no other issues. I understand that we have to
do this special handling only for this exec_stmt_call() because it is
only here that we call exec_prepare_plan() with keep_plan = false, so
doing special handling for freeing the plan seems to make sense.




Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-07-08 Thread Amit Khandekar
FYI : this one is present in the July commitfest.




Re: Inlining of couple of functions in pl_exec.c improves performance

2020-07-06 Thread Amit Khandekar
On Sat, 4 Jul 2020 at 01:19, Tom Lane  wrote:
>
> I did some performance testing on 0001+0002 here, and found that
> for me, there's basically no change on x86_64 but a win of 2 to 3
> percent on aarch64, using Pavel's pi_est_1() as a representative
> case for simple plpgsql statements.  That squares with your original
> results I believe.  It's not clear to me whether any of the later
> tests in this thread measured these changes in isolation, or only
> with 0003 added.

Yeah I had the same observation. 0001+0002 seems to benefit mostly on
aarch64. And 0003 (exec_case_value) benefited both on amd64 and
aarch64.

>
> Anyway, that's good enough for me, so I pushed 0001+0002 after a
> little bit of additional cosmetic tweaking.
>
> I attach your original 0003 here (it still applies, with some line
> offsets).  That's just so the cfbot doesn't get confused about what
> it's supposed to test now.

Thanks for pushing all the three !

Thanks,
-Amit Khandekar
Huawei Technologies




Re: Inlining of couple of functions in pl_exec.c improves performance

2020-07-03 Thread Amit Khandekar
On Thu, 2 Jul 2020 at 03:47, Tom Lane  wrote:
>
> Amit Khandekar  writes:
> > There are a couple of function call overheads I observed in pl/pgsql
> > code : exec_stmt() and exec_cast_value(). Removing these overheads
> > resulted in some performance gains.
>
> I took a look at the 0001/0002 patches (not 0003 as yet).  I do not
> like 0001 too much.  The most concrete problem with it is that
> you broke translatability of the error messages, cf the first
> translatability guideline at [1].

Yeah, I thought we can safely use %s for proper nouns such as "trigger
procedure" or "function" as those would not be translated. But looks
like even if they won't be translated, the difference in word order
among languages might create problems with this.

> While that could be fixed by passing
> the entire message not just part of it, I don't see anything that we're
> gaining by moving that stuff into exec_toplevel_block in the first place.
> Certainly, passing a string that describes what will happen *after*
> exec_toplevel_block is just weird.  I think what you've got here is
> a very arbitrary chopping-up of the existing code based on chance
> similarities of the existing callers.  I think we're better off to make
> exec_toplevel_block be as nearly as possible a match for exec_stmts'
> semantics.

I thought some of those things that I kept in exec_toplevel_block() do
look like they belong to a top level function. But what you are saying
also makes sense : better to keep it similar to exec_stmts.

>
> Hence, I propose the attached 0001 to replace 0001/0002.  This should
> be basically indistinguishable performance-wise, though I have not
> tried to benchmark.

Thanks for the patches. Yeah, performance-wise it does look similar;
but anyways I tried running, and got similar performance numbers.

> Note that for reviewability's sake, I did not
> reindent the former body of exec_stmt, though we'd want to do that
> before committing.

Right.

>
> Also, 0002 is a small patch on top of that to avoid redundant saves
> and restores of estate->err_stmt within the loop in exec_stmts.  This
> may well not be a measurable improvement, but it's a pretty obvious
> inefficiency in exec_stmts now that it's refactored this way.

0002 also makes sense.




Re: calling procedures is slow and consumes extra much memory against calling function

2020-06-16 Thread Amit Khandekar
On Wed, 10 Jun 2020 at 17:12, Pavel Stehule  wrote:
> st 10. 6. 2020 v 12:26 odesílatel Amit Khandekar  
> napsal:
>> Could you show an example testcase that tests this recursive scenario,
>> with which your earlier patch fails the test, and this v2 patch passes
>> it ? I am trying to understand the recursive scenario and the re-use
>> of expr->plan.
>
>
> it hangs on plpgsql tests. So you can apply first version of patch
>
> and "make check"

I could not reproduce the make check hang with the v1 patch. But I
could see a crash with the below testcase. So I understand the purpose
of the plan_owner variable that you introduced in v2.

Consider this recursive test :

create or replace procedure p1(in r int) as $$
begin
   RAISE INFO 'r : % ', r;
   if r < 3 then
  call p1(r+1);
   end if;
end
$$ language plpgsql;

do $$
declare r int default 1;
begin
call p1(r);
end;
$$;

In p1() with r=2, when the stmt "call p1(r+1)" is being executed,
consider this code of exec_stmt_call() with your v2 patch applied:
if (expr->plan && !expr->plan->saved)
{
   if (plan_owner)
  SPI_freeplan(expr->plan);
   expr->plan = NULL;
}

Here, plan_owner is false. So SPI_freeplan() will not be called, and
expr->plan is set to NULL. Now I have observed that the stmt pointer
and expr pointer is shared between the p1() execution at this r=2
level and the p1() execution at r=1 level. So after the above code is
executed at r=2, when the upper level (r=1) exec_stmt_call() lands to
the same above code snippet, it gets the same expr pointer, but it's
expr->plan is already set to NULL without being freed. From this
logic, it looks like the plan won't get freed whenever the expr/stmt
pointers are shared across recursive levels, since expr->plan is set
to NULL at the lowermost level ? Basically, the handle to the plan is
lost so no one at the upper recursion level can explicitly free it
using SPI_freeplan(), right ? This looks the same as the main issue
where the plan does not get freed for non-recursive calls. I haven't
got a chance to check if we can develop a testcase for this, similar
to your testcase where the memory keeps on increasing.

-Amit




Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-06-10 Thread Amit Khandekar
On Wed, 10 Jun 2020 at 04:20, Peter Eisentraut
 wrote:
>
> On 2020-06-09 13:50, Amit Khandekar wrote:
> > Also, the regress/sql/numeric_big test itself speeds up by 80%
>
> That's nice.  I can confirm the speedup:
>
> -O3 without the patch:
>
>   numeric  ... ok  737 ms
> test numeric_big  ... ok 1014 ms
>
> -O3 with the patch:
>
>   numeric  ... ok  680 ms
> test numeric_big  ... ok  580 ms
>
> Also:
>
> -O2 without the patch:
>
>   numeric  ... ok  693 ms
> test numeric_big  ... ok 1160 ms
>
> -O2 with the patch:
>
>   numeric  ... ok  677 ms
> test numeric_big  ... ok  917 ms
>
> So the patch helps either way.

Oh, I didn't observe that the patch helps numeric_big.sql to speed up
to some extent even with -O2. For me, it takes 805 on head and 732 ms
with patch. One possible reason that I can think of is : Because of
the forward loop traversal, pre-fetching might be helping. But this is
just a wild guess.

-O3 : HEAD
test numeric  ... ok  102 ms
test numeric_big  ... ok  803 ms

-O3 : patched :
test numeric  ... ok  100 ms
test numeric_big  ... ok  450 ms


-O2 : HEAD
test numeric  ... ok  100 ms
test numeric_big  ... ok  805 ms

-O2 patched :
test numeric  ... ok  103 ms
test numeric_big  ... ok  732 ms

> But it also seems that without the patch, -O3 might
> be a bit slower in some cases. This might need more testing.

For me, there is no observed change in the times with -O2 versus -O3,
on head. Are you getting a consistent slower numeric*.sql tests with
-O3 on current code ? Not sure what might be the reason.
But this is not related to the patch. Is it with the context of patch
that you are suggesting that it might need more testing ? There might
be existing cases that might be running a bit slower with O3, but
that's strange actually. Probably optimization in those cases might
not be working as thought by the compiler, and in fact they might be
working negatively. Probably that's one of the reasons -O2 is the
default choice.


>
> > For the for loop to be auto-vectorized, I had to simplify it to
> > something like this :
>
> Well, how do we make sure we keep it that way?  How do we prevent some
> random rearranging of the code or some random compiler change to break
> this again?

I believe the compiler rearranges the code segments w.r.t. one another
when those are independent of each other. I guess the compiler is able
to identify that. With our case, it's the for loop. There are some
variables used inside it, and that would prevent it from moving the
for loop. Even if the compiler finds it safe to move relative to
surrounding code, it would not spilt the for loop contents themselves,
so the for loop will remain intact, and so would the vectorization,
although it seems to keep an unrolled version of the loop in case it
is called with smaller iteration counts. But yes, if someone in the
future tries to change the for loop, it would possibly break the
auto-vectorization. So, we should have appropriate comments (patch has
those). Let me know if you find any possible reasons due to which the
compiler would in the future stop the vectorization even when there is
no change in the for loop.

It might look safer if we take the for loop out into an inline
function; just to play it safe ?




Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-10 Thread Amit Khandekar
On Tue, 9 Jun 2020 at 21:49, Pavel Stehule  wrote:
> Is your patch in commitfest in commitfest application?

Thanks for reminding me. Just added.
https://commitfest.postgresql.org/28/2590/




Re: calling procedures is slow and consumes extra much memory against calling function

2020-06-10 Thread Amit Khandekar
On Sat, 16 May 2020 at 00:07, Pavel Stehule  wrote:
>
> Hi
>
>>>
>>> The problem is in plpgsql implementation of CALL statement
>>>
>>> In non atomic case -  case of using procedures from DO block, the 
>>> expression plan is not cached, and plan is generating any time. This is 
>>> reason why it is slow.
>>>
>>> Unfortunately, generated plans are not released until SPI_finish. Attached 
>>> patch fixed this issue.
>>
>>
>> But now, recursive calling doesn't work :-(. So this patch is not enough
>
>
> Attached patch is working - all tests passed

Could you show an example testcase that tests this recursive scenario,
with which your earlier patch fails the test, and this v2 patch passes
it ? I am trying to understand the recursive scenario and the re-use
of expr->plan.

>
> It doesn't solve performance, and doesn't solve all memory problems, but 
> significantly reduce memory requirements from 5007 bytes to 439 bytes per one 
> CALL

So now  this patch's intention is to reduce memory consumption, and it
doesn't target slowness improvement, right ?

-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Auto-vectorization speeds up multiplication of large-precision numerics

2020-06-09 Thread Amit Khandekar
to-vectorization is
-ftree-loop-vectorize. But for -O3 it is always true. What we can do
in the future is to have a separate file that has such optimized code
that is proven to work with such optimization flags, and enable the
required compiler flags only for such files, if the build is done with
-O2.

[1] https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using


-- 
Thanks,
-Amit Khandekar
Huawei Technologies


create_schema.sql
Description: application/sql
From 91aa5ef1034b763e279b4a8970101355e4a79600 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Tue, 9 Jun 2020 16:35:23 +0530
Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric
 product

A 'for' loop in mul_var() runs backwards by decrementing two
variables. This prevents the gcc compiler from auto-vectorizing the
for loop. So make it a forward loop with a single variable. This gives
performance benefits for product of numeric types with large
precision, with speedups becoming noticeable from values
with precisions starting from 20-40. Typical pattern of benefit is :
precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on.
On some CPU architectures, the speedup starts from 20 precision
onwards.
With the precisions used in the numeric_big regression test, the
multiplication speeds up by 2.5 to 2.7 times.

Auto-vectorization happens with -O3 flag or -ftree-loop-vectorize. So
this benefit with be seen when built with gcc -O3.
---
 src/backend/utils/adt/numeric.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index f3a725271e..4243242ad9 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -7226,6 +7226,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	int			res_weight;
 	int			maxdigits;
 	int		   *dig;
+	int		   *digptr;
 	int			carry;
 	int			maxdig;
 	int			newdig;
@@ -7362,10 +7363,14 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		 *
 		 * As above, digits of var2 can be ignored if they don't contribute,
 		 * so we only include digits for which i1+i2+2 <= res_ndigits - 1.
+		 *
+		 * For large precisions, this can become a bottleneck; so keep this for
+		 * loop simple so that it can be auto-vectorized.
 		 */
-		for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2;
-			 i2 >= 0; i2--)
-			dig[i--] += var1digit * var2digits[i2];
+		i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
+		digptr = [i1 + 2];
+		for (i = 0; i <= i2; i++)
+			digptr[i] += var1digit * var2digits[i];
 	}
 
 	/*
-- 
2.17.1



Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-01 Thread Amit Khandekar
On Mon, 1 Jun 2020 at 12:27, Pavel Stehule  wrote:
> po 1. 6. 2020 v 8:15 odesílatel Amit Khandekar  
> napsal:
>>
>> On Sat, 30 May 2020 at 11:11, Pavel Stehule  wrote:
>> > I think so the effect of these patches strongly depends on CPU and compile
>>
>> I quickly tried pi() with gcc 10 as well, and saw more or less the
>> same benefit. I think, we are bound to see some differences in the
>> benefits across architectures, kernels and compilers; but looks like
>> some benefit is always there.
>>
>> > but it is micro optimization, and when I look to profiler, the bottle neck 
>> > is elsewhere.
>>
>> Please check the perf numbers in my reply to Michael. I suppose you
>> meant CachedPlanIsSimplyValid() when you say the bottle neck is
>> elsewhere ? Yeah, this function is always the hottest spot, which I
>> recall is being discussed elsewhere. But I always see exec_stmt(),
>> exec_assign_value as the next functions.
>
>
> It is hard to read the profile result, because these functions are nested 
> together. For your example
>
> 18.22%  postgres  postgres   [.] CachedPlanIsSimplyValid
>
> Is little bit strange, and probably this is real bottleneck in your simple 
> example, and maybe some work can be done there, because you assign just 
> constant.

I had earlier had a quick look on this one. CachedPlanIsSimplyValid()
was, I recall, hitting a hotspot when it tries to access
plansource->search_path (possibly cacheline miss). But didn't get a
chance to further dig on that. For now, i am focusing on these other
functions for which the patches were submitted.


>
> On second hand, your example is pretty unrealistic - and against any 
> developer's best practices for writing cycles.
>
> I think so we can look on PostGIS, where is some computing heavy routines in 
> PLpgSQL, and we can look on real profiles.
>
> Probably the most people will have benefit from these optimization.

I understand it's not a real world example. For generating perf
figures, I had to use an example which amplifies the benefits, so that
the effect of the patches on the perf figures also becomes visible.
Hence, used that example. I had shown the benefits up-thread using a
practical function avg_space(). But the perf figures for that example
were varying a lot.

So below, what I did was : Run the avg_space() ~150 times, and took
perf report. This stabilized the results a bit :

HEAD :
+   16.10%  17.29%  16.82%  postgres  postgres[.]
ExecInterpExpr
+   13.80%  13.56%  14.49%  postgres  plpgsql.so  [.]
exec_assign_value
+   12.64%  12.10%  12.09%  postgres  plpgsql.so  [.]
plpgsql_param_eval_var
+   12.15%  11.28%  11.05%  postgres  plpgsql.so  [.]
exec_stmt
+   10.81%  10.24%  10.55%  postgres  plpgsql.so  [.]
exec_eval_expr
+9.50%   9.35%   9.37%  postgres  plpgsql.so  [.]
exec_cast_value
.
+1.19%   1.06%   1.21%  postgres  plpgsql.so  [.]
exec_stmts


0001+0002 patches applied (i.e. inline exec_stmt) :
+   16.90%  17.20%  16.54%  postgres  postgres[.]
ExecInterpExpr
+   16.42%  15.37%  15.28%  postgres  plpgsql.so  [.]
exec_assign_value
+   11.34%  11.92%  11.93%  postgres  plpgsql.so  [.]
plpgsql_param_eval_var
+   11.18%  11.86%  10.99%  postgres  plpgsql.so  [.] exec_stmts.part.0
+   10.51%   9.52%  10.61%  postgres  plpgsql.so  [.]
exec_eval_expr
+9.39%   9.48%   9.30%  postgres  plpgsql.so  [.]
exec_cast_value

HEAD : exec_stmts + exec_stmt = ~12.7 %
Patched (0001+0002): exec_stmts = 11.3 %

Just 0003 patch applied (i.e. inline exec_cast_value) :
+   17.00%  16.77%  17.09% postgres  postgres   [.] ExecInterpExpr
+   15.21%  15.64%  15.09% postgres  plpgsql.so [.] exec_assign_value
+   14.48%  14.06%  13.94% postgres  plpgsql.so [.] exec_stmt
+   13.26%  13.30%  13.14% postgres  plpgsql.so [.]
plpgsql_param_eval_var
+   11.48%  11.64%  12.66% postgres  plpgsql.so [.] exec_eval_expr

+1.03%   0.85%   0.87% postgres  plpgsql.so [.] exec_stmts

HEAD : exec_assign_value + exec_cast_value = ~23.4 %
Patched (0001+0002): exec_assign_value =  15.3%


Time in milliseconds after calling avg_space() 150 times :
HEAD  : 7210
Patch 0001+0002 : 6925
Patch 0003 : 6670
Patch 0001+0002+0003 : 6346




Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-01 Thread Amit Khandekar
On Sat, 30 May 2020 at 11:11, Pavel Stehule  wrote:
> I think so the effect of these patches strongly depends on CPU and compile

I quickly tried pi() with gcc 10 as well, and saw more or less the
same benefit. I think, we are bound to see some differences in the
benefits across architectures, kernels and compilers; but looks like
some benefit is always there.

> but it is micro optimization, and when I look to profiler, the bottle neck is 
> elsewhere.

Please check the perf numbers in my reply to Michael. I suppose you
meant CachedPlanIsSimplyValid() when you say the bottle neck is
elsewhere ? Yeah, this function is always the hottest spot, which I
recall is being discussed elsewhere. But I always see exec_stmt(),
exec_assign_value as the next functions.

>> > patch 0002 it is little bit against simplicity, but for PLpgSQL with 
>> > blocks structure it is correct.
>>
>> Here, I moved the exec_stmt code into exec_stmts() function because
>> exec_stmts() was the only caller, and that function is not that big. I
>> am assuming you were referring to this point when you said it is a bit
>> against simplicity. But I didn't get what you implied by "but for
>> PLpgSQL with blocks structure it is correct"
>
>
> Nested statement in PLpgSQL is always a list of statements. It is not single 
> statement ever. So is not too strange don't have a function execute_stmt.

Right.




Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-01 Thread Amit Khandekar
On Sun, 31 May 2020 at 08:04, Michael Paquier  wrote:
> This stuff is interesting.  Do you have some perf profiles to share?
> I am wondering what's the effect of the inlining with your test
> cases.

Below are the perf numbers for asignmany.sql :

HEAD :

+   16.88%  postgres  postgres   [.] CachedPlanIsSimplyValid
+   16.64%  postgres  plpgsql.so [.] exec_stmt
+   15.56%  postgres  plpgsql.so [.] exec_eval_expr
+   13.58%  postgres  plpgsql.so [.] exec_assign_value
+7.49%  postgres  plpgsql.so [.] exec_cast_value
+7.17%  postgres  plpgsql.so [.] exec_assign_expr
+5.39%  postgres  postgres   [.] MemoryContextReset
+3.91%  postgres  postgres   [.] ExecJustConst
+3.33%  postgres  postgres   [.] recomputeNamespacePath
+2.88%  postgres  postgres   [.] OverrideSearchPathMatchesCurrent
+2.18%  postgres  plpgsql.so [.] exec_eval_cleanup.isra.17
+2.15%  postgres  plpgsql.so [.] exec_stmts
+1.32%  postgres  plpgsql.so [.] MemoryContextReset@plt
+0.57%  postgres  plpgsql.so [.] CachedPlanIsSimplyValid@plt
+0.57%  postgres  postgres   [.] GetUserId
 0.30%  postgres  plpgsql.so [.] assign_simple_var.isra.13
 0.05%  postgres  [kernel.kallsyms]  [k] unmap_page_range

Patched :

+   18.22%  postgres  postgres   [.] CachedPlanIsSimplyValid
+   17.25%  postgres  plpgsql.so [.] exec_eval_expr
+   16.31%  postgres  plpgsql.so [.] exec_stmts
+   15.00%  postgres  plpgsql.so [.] exec_assign_value
+7.56%  postgres  plpgsql.so [.] exec_assign_expr
+5.64%  postgres  postgres   [.] MemoryContextReset
+5.16%  postgres  postgres   [.] ExecJustConst
+4.86%  postgres  postgres   [.] recomputeNamespacePath
+4.54%  postgres  postgres   [.] OverrideSearchPathMatchesCurrent
+2.33%  postgres  plpgsql.so [.] exec_eval_cleanup.isra.17
+1.26%  postgres  plpgsql.so [.] MemoryContextReset@plt
+0.81%  postgres  postgres   [.] GetUserId
+0.71%  postgres  plpgsql.so [.] CachedPlanIsSimplyValid@plt
 0.26%  postgres  plpgsql.so [.] assign_simple_var.isra.13
 0.03%  postgres  [kernel.kallsyms]  [k] unmap_page_range
 0.02%  postgres  [kernel.kallsyms]  [k] mark_page_accessed

Notice the reduction in percentages :
HEAD : exec_stmts + exec_stmt = 18.79
Patched : exec_stmts = 16.31

HEAD : exec_assign_value + exec_cast_value : 21.07
Patched : exec_assign_value = 15.00

As expected, reduction of percentage in these two functions caused
other functions like CachedPlanIsSimplyValid() and exec_eval_expr() to
show rise in their percentages.




Re: Inlining of couple of functions in pl_exec.c improves performance

2020-05-29 Thread Amit Khandekar
On Thu, 28 May 2020 at 14:39, Pavel Stehule  wrote:
> I don't see 17% anywhere, but 3-5% is not bad.
Did you see 3-5% only for the pi function, or did you see the same
improvement also for the functions that I wrote ? I was getting a
consistent result of 14-18 % on both of the VMs. Also, is your test
machine running on Windows ? All the machines I tested were on Linux
kernel (Ubuntu)

Below are my results for your pi_est_1() function. For this function,
I am consistently getting 5-9 % improvement. I tested on 3 machines :

gcc : 8.4.0. -O2 option
OS : Ubuntu Bionic

explain analyze select pi_est_1(1000)

1. x86_64 laptop VM (Intel Core i7-8665U)
HEAD :2666 2617 2600 2630
Patched : 2502 2409 2460 2444


2. x86_64 VM (Xeon Gold 6151)
HEAD :1664 1662 1721 1660
Patched : 1541 1548 1537 1526

3. ARM64 VM (Kunpeng)
HEAD :2873 2864 2860 2861
Patched : 2568 2513 2501 2538


>
> patch 0001 has sense and can help with code structure
> patch 0002 it is little bit against simplicity, but for PLpgSQL with blocks 
> structure it is correct.

Here, I moved the exec_stmt code into exec_stmts() function because
exec_stmts() was the only caller, and that function is not that big. I
am assuming you were referring to this point when you said it is a bit
against simplicity. But I didn't get what you implied by "but for
PLpgSQL with blocks structure it is correct"

-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: Inlining of couple of functions in pl_exec.c improves performance

2020-05-27 Thread Amit Khandekar
On Tue, 26 May 2020 at 09:06, Amit Khandekar  wrote:
>
> On Sat, 23 May 2020 at 23:24, Pavel Stehule  wrote:
> >
> >FOR counter IN 1..180 LOOP
> >   id = 0; id = 0; id1 = 0;
> >   id2 = 0; id3 = 0; id1 = 0; id2 = 0;
> >   id3 = 0; id = 0; id = 0; id1 = 0;
> >   id2 = 0; id3 = 0; id1 = 0; id2 = 0;
> >   id3 = 0;
> >END LOOP;
> >
> > This is not too much typical PLpgSQL code. All expressions are not 
> > parametrized - so this test is little bit obscure.
> >
> > Last strange performance plpgsql benchmark did calculation of pi value. It 
> > does something real
>
> Yeah, basically I wanted to have many statements, and that too with
> many assignments where casts are not required. Let me check if I can
> come up with a real-enough testcase. Thanks.

create table tab (id int[]);
insert into tab select array((select ((random() * 10)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 60)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 100)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 10)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 60)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 100)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 10)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 60)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 100)::bigint) id
from generate_series(1, 3) order by 1));


-- Return how much two consecutive array elements are apart from each
other, on average; i.e. how much the numbers are spaced out.
-- Input is an ordered array of integers.
CREATE OR REPLACE FUNCTION avg_space(int[]) RETURNS bigint AS $$
DECLARE
  diff int = 0;
  num int;
  prevnum int = 1;
BEGIN
  FOREACH num IN ARRAY $1
  LOOP
diff = diff + num - prevnum;
prevnum = num;
  END LOOP;
  RETURN diff/array_length($1, 1);
END;
$$ LANGUAGE plpgsql;

explain analyze select avg_space(id) from tab;
Like earlier figures, these are execution times in milliseconds, taken
from explain-analyze.
ARM VM:
   HEAD : 49.8
   patch 0001+0002   : 47.8 => 4.2%
   patch 0001+0002+0003 : 42.9 => 16.1%
x86 VM:
   HEAD : 32.8
   patch 0001+0002   : 32.7 => 0%
   patch 0001+0002+0003 : 28.0 => 17.1%




Re: Inlining of couple of functions in pl_exec.c improves performance

2020-05-25 Thread Amit Khandekar
On Sat, 23 May 2020 at 23:24, Pavel Stehule  wrote:
>
>FOR counter IN 1..180 LOOP
>   id = 0; id = 0; id1 = 0;
>   id2 = 0; id3 = 0; id1 = 0; id2 = 0;
>   id3 = 0; id = 0; id = 0; id1 = 0;
>   id2 = 0; id3 = 0; id1 = 0; id2 = 0;
>   id3 = 0;
>END LOOP;
>
> This is not too much typical PLpgSQL code. All expressions are not 
> parametrized - so this test is little bit obscure.
>
> Last strange performance plpgsql benchmark did calculation of pi value. It 
> does something real

Yeah, basically I wanted to have many statements, and that too with
many assignments where casts are not required. Let me check if I can
come up with a real-enough testcase. Thanks.




Inlining of couple of functions in pl_exec.c improves performance

2020-05-23 Thread Amit Khandekar
There are a couple of function call overheads I observed in pl/pgsql
code : exec_stmt() and exec_cast_value(). Removing these overheads
resulted in some performance gains.

exec_stmt() :

plpgsql_exec_function() and other toplevel block executors currently
call exec_stmt(). But actually they don't need to do everything that
exec_stmt() does. So they can call a new function instead of
exec_stmt(), and all the exec_stmt() code can be moved to
exec_stmts(). The things that exec_stmt() do, but are not necessary
for a top level block stmt, are :

1. save_estmt = estate->err_stmt; estate->err_stmt = stmt;
For top level blocks, saving the estate->err_stmt is not necessary,
because there is no statement after this block statement. Anyways,
plpgsql_exec_function() assigns estate.err_stmt just before calling
exec_stmt so there is really no point in exec_stmt() setting it again.

2. CHECK_FOR_INTERRUPTS()
This is not necessary for toplevel block callers.

3. exec_stmt_block() can be directly called rather than exec_stmt()
because func->action is a block statement. So the switch statement is
not necessary.

But this one might be necessary for toplevel block statement:
  if (*plpgsql_plugin_ptr && (*plpgsql_plugin_ptr)->stmt_beg)
 ((*plpgsql_plugin_ptr)->stmt_beg) (estate, stmt);

There was already a repetitive code in plpgsql_exec_function() and
other functions around the exec_stmt() call. So in a separate patch
0001*.patch, I moved that code into a common function
exec_toplevel_block(). In the main patch
0002-Get-rid-of-exec_stmt-function-call.patch, I additionally called
plpgsql_plugin_ptr->stmt_beg() inside exec_toplevel_block(). And moved
exec_stmt() code into exec_stmts().



exec_cast_value() :

This function does not do the casting if not required. So moved the
code that actually does the cast into a separate function, so as to
reduce the exec_cast_value() code and make it inline. Attached is the
0003-Inline-exec_cast_value.patch


Testing
--

I used two available VMs (one x86_64 and the other arm64), and the
benefit showed up on both of these machines. Attached patches 0001,
0002, 0003 are to be applied in that order. 0001 is just a preparatory
patch.

First I tried with a simple for loop with a single assignment
(attached forcounter.sql)

By inlining of the two functions, found noticeable reduction in
execution time as shown (figures are in milliseconds, averaged over
multiple runs; taken from 'explain analyze' execution times) :
ARM VM :
   HEAD : 100 ; Patched : 88 => 13.6% improvement
x86 VM :
   HEAD :  71 ; Patched : 66 => 7.63% improvement.

Then I included many assignment statements as shown in attachment
assignmany.sql. This showed further benefit :
ARM VM :
   HEAD : 1820 ; Patched : 1549  => 17.5% improvement
x86 VM :
   HEAD : 1020 ; Patched :  869  => 17.4% improvement

Inlining just exec_stmt() showed the improvement mainly on the arm64
VM (7.4%). For x86, it was 2.7%
But inlining exec_stmt() and exec_cast_value() together showed
benefits on both machines, as can be seen above.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 66c607ef6f0b7b655819b4b19383e024c5f8788c Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Sat, 23 May 2020 21:39:41 +0800
Subject: [PATCH 1/3] Modularize code in toplevel pl/pgsql block callers

Functions that call exec_stmt() for executing the toplevel block have
a repetitive code that is now moved into a common function
exec_toplevel_block(). This in preparation for further changes in
this part of code.
---
 src/pl/plpgsql/src/pl_exec.c | 86 
 1 file changed, 38 insertions(+), 48 deletions(-)

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 9a87cd70f1..0a70ceddbb 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -267,6 +267,9 @@ static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate);
 static void push_stmt_mcontext(PLpgSQL_execstate *estate);
 static void pop_stmt_mcontext(PLpgSQL_execstate *estate);
 
+static void exec_toplevel_block(PLpgSQL_execstate *estate,
+PLpgSQL_stmt_block *block,
+char *object_type, char *err_text);
 static int	exec_stmt_block(PLpgSQL_execstate *estate,
 			PLpgSQL_stmt_block *block);
 static int	exec_stmts(PLpgSQL_execstate *estate,
@@ -475,7 +478,6 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo,
 	PLpgSQL_execstate estate;
 	ErrorContextCallback plerrcontext;
 	int			i;
-	int			rc;
 
 	/*
 	 * Setup the execution state
@@ -605,23 +607,8 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo,
 	/*
 	 * Now call the toplevel block of statements
 	 */
-	estate.err_text = NULL;
-	estate.err_stmt = (PLpgSQL_stmt *) (func->action);
-	rc = exec_stmt(, (PLpgSQL_stmt *) func->action);
-	if (rc != PLPGSQL_RC_RETURN)
-	{
-		estate.err_stmt = NULL;
-		estate.err_text = NULL;
-		ereport(ERROR,
-(errcode(ERRC

pgbench testing with contention scenarios

2020-04-20 Thread Amit Khandekar
Subject changed. Earlier it was : spin_delay() for ARM

On Fri, 17 Apr 2020 at 22:54, Robert Haas  wrote:
>
> On Thu, Apr 16, 2020 at 3:18 AM Amit Khandekar  wrote:
> > Not relevant to the PAUSE stuff  Note that when the parallel
> > clients reach from 24 to 32 (which equals the machine CPUs), the TPS
> > shoots from 454189 to 1097592 which is more than double speed gain
> > with just a 30% increase in parallel sessions.
For referencem the TPS can be seen here :
https://www.postgresql.org/message-id/CAJ3gD9e86GY%3DQfyfZQkb11Z%2BCVWowDiGgGThzKKwHDGU9uA2yA%40mail.gmail.com
>
> I've seen stuff like this too. For instance, check out the graph from
> this 2012 blog post:
>
> http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html
>
> You can see that the performance growth is basically on a straight
> line up to about 16 cores, but then it kinks downward until about 28,
> after which it kinks sharply upward until about 36 cores.
>
> I think this has something to do with the process scheduling behavior
> of Linux, because I vaguely recall some discussion where somebody did
> benchmarking on the same hardware on both Linux and one of the BSD
> systems, and the effect didn't appear on BSD. They had other problems,
> like a huge drop-off at higher core counts, but they didn't have that
> effect.

Ah I see.

By the way, I have observed this behaviour in both x86 and ARM,
regardless of whether the CPUs are as low as 8, or as high as 32.

But for me, I suspect it's a combination of linux scheduler and
interactions between backends and pgbench clients.

So I used a custom script. I used the same point query that is
used with the -S option, but that query is run again and again on the
server side without the client having to send it again and again, so
the pgbench clients are idle most of the time.

Query used :
select foo(30);
where foo(int) is defined as :
create or replace function foo(iterations int) returns int as $$
declare
   id int; ret int ; counter int = 0;
begin
   WHILE counter < iterations
   LOOP
  counter = counter + 1;
  id = random() * 300;
  select into ret aid from pgbench_accounts where aid = id;
   END LOOP;
  return ret;
end $$ language plpgsql;

Below are results for 30 scale factor, with 8 CPUs :

Clients  TPS
21.255327
42.414139
63.532937
84.586583
10   4.557575
12   4.517226
14   4.551455
18   4.593271

You can see that the tps rise is almost linearly proportional to
increase in clients, with no deviation in between, upto 8 clients
where it does not rise because CPUs are fully utilized,

In this custom case as well, the behaviour is same for both x86 and
ARM, regardless of 8 CPUs or 32 CPUs.


--
Thanks,
-Amit Khandekar
Huawei Technologies




Re: spin_delay() for ARM

2020-04-20 Thread Amit Khandekar
On Sat, 18 Apr 2020 at 03:30, Thomas Munro  wrote:
>
> On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma  wrote:
> > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule  wrote:
> > > what I know, pgbench cannot be used for testing spinlocks problems.
> > >
> > > Maybe you can see this issue when a) use higher number clients - 
> > > hundreds, thousands. Decrease share memory, so there will be press on 
> > > related spin lock.
> >
> > There really aren't many spinlocks left that could be tickled by a
> > normal workload. I looked for a way to trigger spinlock contention
> > when I prototyped a patch to replace spinlocks with futexes. The only
> > one that I could figure out a way to make contended was the lock
> > protecting parallel btree scan. A highly parallel index only scan on a
> > fully cached index should create at least some spinlock contention.
>
> I suspect the snapshot-too-old "mutex_threshold" spinlock can become
> contended under workloads that generate a high rate of
> heap_page_prune_opt() calls with old_snapshot_threshold enabled.  One
> way to do that is with a bunch of concurrent index scans that hit the
> heap in random order.  Some notes about that:
>
> https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com

Thanks all for the inputs. Will keep these two particular scenarios in
mind, and try to get some bandwidth on this soon.


-- 
Thanks,
-Amit Khandekar
Huawei Technologies




Re: spin_delay() for ARM

2020-04-16 Thread Amit Khandekar
On Mon, 13 Apr 2020 at 20:16, Amit Khandekar  wrote:
> On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
> >
> > I wrote:
> > > A more useful test would be to directly experiment with contended
> > > spinlocks.  As I recall, we had some test cases laying about when
> > > we were fooling with the spin delay stuff on Intel --- maybe
> > > resurrecting one of those would be useful?
> >
> > The last really significant performance testing we did in this area
> > seems to have been in this thread:
> >
> > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> >
> > A relevant point from that is Haas' comment
> >
> > I think optimizing spinlocks for machines with only a few CPUs is
> > probably pointless.  Based on what I've seen so far, spinlock
> > contention even at 16 CPUs is negligible pretty much no matter what
> > you do.  Whether your implementation is fast or slow isn't going to
> > matter, because even an inefficient implementation will account for
> > only a negligible percentage of the total CPU time - much less than 1%
> > - as opposed to a 64-core machine, where it's not that hard to find
> > cases where spin-waits consume the *majority* of available CPU time
> > (recall previous discussion of lseek).
>
> Yeah, will check if I find some machines with large cores.

I got hold of a 32 CPUs VM (actually it was a 16-core, but being
hyperthreaded, CPUs were 32).
It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
KVM. Single NUMA node.
PG parameters changed : shared_buffer: 8G ; max_connections : 1000

I compared pgbench results with HEAD versus PAUSE removed like this :
 perform_spin_delay(SpinDelayStatus *status)
 {
-   /* CPU-specific delay each time through the loop */
-   SPIN_DELAY();

Ran with increasing number of parallel clients :
pgbench -S -c $num -j $num -T 60 -M prepared
But couldn't find any significant change in the TPS numbers with or
without PAUSE:

Clients HEAD Without_PAUSE
8 26   247264
16399939   399549
24454189   453244
32   1097592  1098844
40   1090424  1087984
48   1068645  1075173
64   1035035  1039973
96976578   970699

May be it will indeed show some difference only with around 64 cores,
or perhaps a bare metal machine will help; but as of now I didn't get
such a machine. Anyways, I thought why not archive the results with
whatever I have.

Not relevant to the PAUSE stuff  Note that when the parallel
clients reach from 24 to 32 (which equals the machine CPUs), the TPS
shoots from 454189 to 1097592 which is more than double speed gain
with just a 30% increase in parallel sessions. I was not expecting
this much speed gain, because, with contended scenario already pgbench
processes are already taking around 20% of the total CPU time of
pgbench run. May be later on, I will get a chance to run with some
customized pgbench script that runs a server function which keeps on
running an index scan on pgbench_accounts, so as to make pgbench
clients almost idle.

Thanks
-Amit Khandekar




Re: spin_delay() for ARM

2020-04-13 Thread Amit Khandekar
On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
>
> I wrote:
> > A more useful test would be to directly experiment with contended
> > spinlocks.  As I recall, we had some test cases laying about when
> > we were fooling with the spin delay stuff on Intel --- maybe
> > resurrecting one of those would be useful?
>
> The last really significant performance testing we did in this area
> seems to have been in this thread:
>
>
https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
>
> A relevant point from that is Haas' comment
>
> I think optimizing spinlocks for machines with only a few CPUs is
> probably pointless.  Based on what I've seen so far, spinlock
> contention even at 16 CPUs is negligible pretty much no matter what
> you do.  Whether your implementation is fast or slow isn't going to
> matter, because even an inefficient implementation will account for
> only a negligible percentage of the total CPU time - much less than 1%
> - as opposed to a 64-core machine, where it's not that hard to find
> cases where spin-waits consume the *majority* of available CPU time
> (recall previous discussion of lseek).

Yeah, will check if I find some machines with large cores.


> So I wonder whether this patch is getting ahead of the game.  It does
> seem that ARM systems with a couple dozen cores exist, but are they
> common enough to optimize for yet?  Can we even find *one* to test on
> and verify that this is a win and not a loss?  (Also, seeing that
> there are so many different ARM vendors, results from just one
> chipset might not be too trustworthy ...)

Ok. Yes, it would be worth waiting to see if there are others in the
community with ARM systems that have implemented YIELD. May be after that
we might gain some confidence. I myself also hope that I will get one soon
to test, but right now I have one that does not support it, so it will be
just a no-op.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
-- 
Thanks,
-Amit Khandekar
Huawei Technologies


Re: spin_delay() for ARM

2020-04-13 Thread Amit Khandekar
On Sat, 11 Apr 2020 at 00:47, Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
> > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
> > using a sample C program (attached spin.c). Here, many child processes
> > (much more than CPUs) wait in a tight loop for a shared variable to
> > become 0, while the parent process continuously increments a sequence
> > number for a fixed amount of time, after which, it sets the shared
> > variable to 0. The child's tight loop calls PAUSE in each iteration.
> > What I hoped was that because of PAUSE in children, the parent process
> > would get more share of the CPU, due to which, in a given time, the
> > sequence number will reach a higher value. Also, I expected the CPU
> > cycles spent by child processes to drop down, thanks to PAUSE. None of
> > these happened. There was no change.
>
> > Possibly, this testcase is not right. Probably the process preemption
> > occurs only within the set of hyperthreads attached to a single core.
> > And in my testcase, the parent process is the only one who is ready to
> > run. Still, I have anyway attached the program (spin.c) for archival;
> > in case somebody with a YIELD-supporting ARM machine wants to use it
> > to test YIELD.
>
> PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
> just schedule another process - you won't see different CPU usage if you
> measure it purely as the time running.

Yeah, I thought that the OS scheduling would be an *indirect* consequence
of the pause because of it's slowing down the CPU, but looks like that does
not happen.


> You should be able to see a
> difference if you measure with a profiler that shows you data from the
> CPUs performance monitoring unit.
Hmm, I had tried with perf and could see the pause itself consuming 5% cpu.
But I haven't yet played with per-process figures.



-- 
Thanks,
-Amit Khandekar
Huawei Technologies
-- 
Thanks,
-Amit Khandekar
Huawei Technologies


spin_delay() for ARM

2020-04-10 Thread Amit Khandekar
Hi,

We use (an equivalent of) the PAUSE instruction in spin_delay() for
Intel architectures. The goal is to slow down the spinlock tight loop
and thus prevent it from eating CPU and causing CPU starvation, so
that other processes get their fair share of the CPU time. Intel
documentation [1] clearly mentions this, along with other benefits of
PAUSE, like, low power consumption, and avoidance of memory order
violation while exiting the loop.

Similar to PAUSE, the ARM architecture has YIELD instruction, which is
also clearly documented [2]. It explicitly says that it is a way to
hint the CPU that it is being called in a spinlock loop and this
process can be preempted out.  But for ARM, we are not using any kind
of spin delay.

For PG spinlocks, the goal of both of these instructions are the same,
and also both architectures recommend using them in spinlock loops.
Also, I found multiple places where YIELD is already used in same
situations : Linux kernel [3] ; OpenJDK [4],[5]

Now, for ARM implementations that don't implement YIELD, it runs as a
no-op.  Unfortunately the ARM machine I have does not implement YIELD.
But recently there has been some ARM implementations that are
hyperthreaded, so they are expected to actually do the YIELD, although
the docs do not explicitly say that YIELD has to be implemented only
by hyperthreaded implementations.

I ran some pgbench tests to test PAUSE/YIELD on the respective
architectures, once with the instruction present, and once with the
instruction removed.  Didn't see change in the TPS numbers; they were
more or less same. For Arm, this was expected because my ARM machine
does not implement it.

On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
using a sample C program (attached spin.c). Here, many child processes
(much more than CPUs) wait in a tight loop for a shared variable to
become 0, while the parent process continuously increments a sequence
number for a fixed amount of time, after which, it sets the shared
variable to 0. The child's tight loop calls PAUSE in each iteration.
What I hoped was that because of PAUSE in children, the parent process
would get more share of the CPU, due to which, in a given time, the
sequence number will reach a higher value. Also, I expected the CPU
cycles spent by child processes to drop down, thanks to PAUSE. None of
these happened. There was no change.

Possibly, this testcase is not right. Probably the process preemption
occurs only within the set of hyperthreads attached to a single core.
And in my testcase, the parent process is the only one who is ready to
run. Still, I have anyway attached the program (spin.c) for archival;
in case somebody with a YIELD-supporting ARM machine wants to use it
to test YIELD.

Nevertheless, I think because we have clear documentation that
strongly recommends to use it, and because it has been used in other
use-cases such as linux kernel and JDK, we should start using YIELD
for spin_delay() in ARM.

Attached is the trivial patch (spin_delay_for_arm.patch). To start
with, it contains changes only for aarch64. I haven't yet added
changes in configure[.in] for making sure yield compiles successfully
(YIELD is present in manuals from ARMv6 onwards). Before that I
thought of getting some comments; so didn't do configure changes yet.


[1] https://c9x.me/x86/html/file_module_x86_id_232.html
[2] 
https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield
[3] 
https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259
[4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html
[5] 
http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html


--
Thanks,
-Amit Khandekar
Huawei Technologies
/*
 * Sample program to test the effect of PAUSE/YIELD instruction in a highly
 * contended scenario.  The Intel and ARM docs recommend the use of PAUSE and
 * YIELD respectively, in spinlock tight loops.
 *
 * This program can be run with :
 * gcc -O3 -o spin spin.c -lrt ; ./spin [number_of_processes]
 * By default, 4 processes are spawned.
 *
 * Child processes wait in a tight loop for a shared variable to become 0,
 * while the parent process continuously increments a sequence number for a
 * fixed amount of time, after which, it sets the shared variable to 0. The
 * child tight loop calls YIELD/PAUSE in each iteration.
 *
 * The intention is to create a number of processes much larger than the
 * available CPUs, so that the scheduler hopefully pre-empts the processes
 * because of the PAUSE, and the main process gets more CPU share because of
 * which it will increment its sequence number more number of times. So the
 * expectation is that with PAUSE, the program will end up with a much higher
 * sequence number than without PAUSE. Similarly, the child processes should
 * have lesser CPU cycles with PAUSE than without PAUSE.
 *
 * Author: Amit Khandekar
 */

#include 
#include 
#include 
#include

Re: Minimal logical decoding on standbys

2020-01-20 Thread Amit Khandekar
On Fri, 17 Jan 2020 at 13:20, Andreas Joseph Krogh  wrote:
>
> PÃ¥ torsdag 16. januar 2020 kl. 05:42:24, skrev Amit Khandekar 
> :
>
> On Fri, 10 Jan 2020 at 17:50, Rahila Syed  wrote:
> >
> > Hi Amit,
> >
> > Can you please rebase the patches as they don't apply on latest master?
>
> Thanks for notifying. Attached is the rebased version.
>
>
> Will this patch enable logical replication from a standby-server?

Sorry for the late reply.
This patch only supports logical decoding from standby. So it's just
an infrastructure for supporting logical replication from standby. We
don't support creating a publication from standby, but the publication
on master is replicated on standby, so we might be able to create
subscription nodes that connect to existing publications on standby,
but basically we haven't tested whether the publication/subscription
model works with a publication on a physical standby. This patch is
focussed on providing a way to continue logical replication *after*
the standby is promoted as master.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2020-01-15 Thread Amit Khandekar
On Fri, 10 Jan 2020 at 17:50, Rahila Syed  wrote:
>
> Hi Amit,
>
> Can you please rebase the patches as they don't apply on latest master?

Thanks for notifying. Attached is the rebased version.


logicaldecodng_standby_v5_rebased.tar.gz
Description: GNU Zip compressed data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-01-08 Thread Amit Khandekar
On Sun, 5 Jan 2020 at 00:21, Noah Misch  wrote:
> The buildfarm client can capture stack traces, but it currently doesn't do so
> for TAP test suites (search the client code for get_stack_trace).  If someone
> feels like writing a fix for that, it would be a nice improvement.  Perhaps,
> rather than having the client code know all the locations where core files
> might appear, failed runs should walk the test directory tree for core files?

I think this might end up having the same code to walk the directory
spread out on multiple files. Instead, I think in the build script, in
get_stack_trace(), we can do an equivalent of "find  -name
"*core*" , as against the current way in which it looks for core files
only in the specific data directory. So get_stack_trace(bindir,
datadir) would change to get_stack_trace(bindir, input_dir) where
input_dir can be any directory that can contain multiple data
directories. E.g. a recovery test can create multiple instances so
there would be multiple data directories inside the test directory.

Noah, is it possible to run a patch'ed build script once I submit a
patch, so that we can quickly get the stack trace ? I mean, can we do
this before getting the patch committed ? I guess, we can run the
build script with a single branch specified, right ?




--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2020-01-03 Thread Amit Khandekar
On Fri, 3 Jan 2020 at 10:19, Amit Kapila  wrote:
>
> On Fri, Jan 3, 2020 at 8:29 AM Amit Kapila  wrote:
>>
>> On Thu, Jan 2, 2020 at 5:44 PM Amit Kapila  wrote:
>>>
>>> On Tue, Dec 24, 2019 at 2:31 PM Amit Khandekar  
>>> wrote:
>>> >
>>> >
>>> > Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it
>>> > works there. So please use this patch for all the three branches.
>>> >
>>>
>>> Pushed!
>>
>>
>>
>> I see one failure in REL_10_STABLE [1] which seems to be due to this commit:
>>
>
> I tried this test on my CentOs and Power8 machine more than 50 times, but 
> couldn't reproduce it.  So, adding Noah to see if he can try this test [1] on 
> his machine (tern) and get stack track or some other information?
>
> [1] - make -C src/test/recovery/ check PROVE_TESTS=t/006_logical_decoding.pl

I also tested multiple times using PG 10 branch; also tried to inject
an error so that PG_CATCH related code also gets covered, but
unfortunately didn't get the crash on my machine. I guess, we will
have to somehow get the stacktrace.

>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-12-26 Thread Amit Khandekar
On Tue, 24 Dec 2019 at 14:02, Amit Khandekar  wrote:
>
> On Thu, 19 Dec 2019 at 01:02, Rahila Syed  wrote:
> >
> > Hi,
> >
> >> Hi, do you consistently get this failure on your machine ? I am not
> >> able to get this failure, but I am going to analyze when/how this can
> >> fail. Thanks
> >>
> > Yes, I am getting it each time I run make -C src/test/recovery/ check 
> > PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> > Also, there aren't any errors in logs indicating the cause.
>
> Thanks for the reproduction. Finally I could reproduce the behaviour.
> It occurs once in 7-8 runs of the test on my machine. The issue is :
> on master, the catalog_xmin does not immediately get updated. It
> happens only after the hot standby feedback reaches on master. And I
> haven't used wait_for_xmins() for these failing cases. I should use
> that. Working on the same ...

As mentioned above, I have used wait_for_xmins() so that we can wait
for the xmins to be updated after hot standby feedback is processed.
In one of the 3 scenarios where it failed for you, I removed the check
at the second place because it was redundant. At the 3rd place, I did
some appropriate changes with detailed comments. Please check.
Basically we are checking that the master's phys catalog_xmin has
advanced but not beyond standby's logical catalog_xmin. And for making
sure the master's xmins are updated, I call txid_current() and then
wait for the master's xmin to advance after hot-standby_feedback, and
in this way I make sure the xmin/catalog_xmins are now up-to-date
because of hot-standby-feedback, so that we can check whether the
master's physical slot catalog_xmin has reached the value of standby's
catalog_xmin but not gone past it.

I have also moved the "wal_receiver_status_interval = 1" setting from
master to standby. It was wrongly kept in master. This now reduces the
test time by half, on my machine.

Attached patch set v5 has only the test changes. Please check if now
the test fails for you.

>
> --
> Thanks,
> -Amit Khandekar
> EnterpriseDB Corporation
> The Postgres Database Company



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


logicaldecodng_standby_v5.tar.gz
Description: GNU Zip compressed data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-24 Thread Amit Khandekar
On Tue, 24 Dec 2019 at 10:41, Amit Kapila  wrote:
>
> On Fri, Dec 20, 2019 at 9:31 AM Amit Khandekar  wrote:
> > Attached are the patches from master back up to 94 branch.
> >
> > PG 9.4 and 9.5 have a common patch to be applied :
> > pg94_95_use_vfd_for_logrep.patch
> > From PG 9.6 onwards, each version has a separate patch.
> >
> > For PG 9.6, there is no logical decoding perl test file. So I have
> > made a new file 006_logical_decoding_spill.pl that has only the
> > specific testcase. Also, for building the test_decoding.so, I had to
> > add the EXTRA_INSTALL=contrib/test_decoding line in the
> > src/test/recovery/Makefile, because this is the first time we are
> > using the plugin in the 9.6 tap test.
> >
>
> I am not sure if we need to go that far for 9.6 branch.  If the other
> tests for logical decoding are not present, then I don't see why we
> need to create a new test file for this test only.  Also, I think this
> will make the patch the same for 9.4,9.5 and 9.6.

Ok. I tested pg94_95_use_vfd_for_logrep.patch for 9.6 branch, and it
works there. So please use this patch for all the three branches.

>
> > From PG 10 onwards, pgstat_report_*() calls around read() are removed
> > in the patch, because FileRead() itself reports the wait events.
> >
>
> Why there are different patches for 10 and 11?
For PG10, OpenTransientFile() and PathNameOpenFile() each have an
extra parameter for specifying file creation modes such as S_IRUSR or
S_IWUSR. For 11, these functions don't accept the flags, rather the
file is always opened with PG_FILE_MODE_OWNER. Because of these
differences in the calls, the PG 10 patch does not apply on 11.


>
> We should try to minimize the difference between patches in different 
> branches.
Ok.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-12-24 Thread Amit Khandekar
On Thu, 19 Dec 2019 at 01:02, Rahila Syed  wrote:
>
> Hi,
>
>> Hi, do you consistently get this failure on your machine ? I am not
>> able to get this failure, but I am going to analyze when/how this can
>> fail. Thanks
>>
> Yes, I am getting it each time I run make -C src/test/recovery/ check 
> PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> Also, there aren't any errors in logs indicating the cause.

Thanks for the reproduction. Finally I could reproduce the behaviour.
It occurs once in 7-8 runs of the test on my machine. The issue is :
on master, the catalog_xmin does not immediately get updated. It
happens only after the hot standby feedback reaches on master. And I
haven't used wait_for_xmins() for these failing cases. I should use
that. Working on the same ...

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-19 Thread Amit Khandekar
On Thu, 19 Dec 2019 at 11:59, Amit Kapila  wrote:
>
> On Wed, Dec 18, 2019 at 12:34 PM Amit Khandekar  
> wrote:
> >
> > On Tue, 17 Dec 2019 at 17:40, Amit Khandekar  wrote:
> > > By the way, the backport patch is turning out to be simpler. It's
> > > because in pre-12 versions, the file offset is part of the Vfd
> > > structure, so all the offset handling is not required.
> >
> > Please have a look at the attached backport patch for PG 11. branch.
> > Once you are ok with the patch, I will port it on other branches.
> > Note that in the patch, wherever applicable I have renamed the fd
> > variable to vfd to signify that it is a vfd, and not the kernel fd. If
> > we don't do the renaming, the patch would be still smaller, but I
> > think the renaming makes sense.
> >
>
> The other usage of PathNameOpenFile in md.c is already using 'fd' as a
> variable name (also, if you see example in fd.h, that also uses fd as
> variable name), so I don't see any problem with using fd especially if
> that leads to lesser changes.

Ok. I have retained fd name.

> Apart from that, your patch LGTM.
Attached are the patches from master back up to 94 branch.

PG 9.4 and 9.5 have a common patch to be applied :
pg94_95_use_vfd_for_logrep.patch
>From PG 9.6 onwards, each version has a separate patch.

For PG 9.6, there is no logical decoding perl test file. So I have
made a new file 006_logical_decoding_spill.pl that has only the
specific testcase. Also, for building the test_decoding.so, I had to
add the EXTRA_INSTALL=contrib/test_decoding line in the
src/test/recovery/Makefile, because this is the first time we are
using the plugin in the 9.6 tap test.

>From PG 10 onwards, pgstat_report_*() calls around read() are removed
in the patch, because FileRead() itself reports the wait events.

>From PG 12 onwards, the vfd offset handling had to be added, because
the offset is not present in Vfd structure.

In master, logical_decoding_work_mem is used in the test file.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


use_vfd_for_logrep_patches.tar.gz
Description: GNU Zip compressed data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-17 Thread Amit Khandekar
On Tue, 17 Dec 2019 at 17:40, Amit Khandekar  wrote:
> By the way, the backport patch is turning out to be simpler. It's
> because in pre-12 versions, the file offset is part of the Vfd
> structure, so all the offset handling is not required.

Please have a look at the attached backport patch for PG 11. branch.
Once you are ok with the patch, I will port it on other branches.
Note that in the patch, wherever applicable I have renamed the fd
variable to vfd to signify that it is a vfd, and not the kernel fd. If
we don't do the renaming, the patch would be still smaller, but I
think the renaming makes sense.

The recovery TAP tests don't seem to be there on 9.4 and 9.5 branch,
so I think it's ok to not have any tests with the patches on these
branches that don't have the tap tests.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


pg11_use_vfd_for_logrep.patch
Description: Binary data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-17 Thread Amit Khandekar
On Mon, 16 Dec 2019 at 16:52, Amit Kapila  wrote:
>
> On Mon, Dec 16, 2019 at 3:26 PM Amit Khandekar  wrote:
> >
> > On Sat, 14 Dec 2019 at 11:59, Amit Kapila  wrote:
> > >
> > > I have also made minor changes related to below code in patch:
> > > - else if (readBytes != sizeof(ReorderBufferDiskChange))
> > > +
> > > + file->curOffset += readBytes;
> > > +
> > > + if (readBytes !=
> > > sizeof(ReorderBufferDiskChange))
> > >
> > > Why the size is added before the error check?
> > The logic was : even though it's an error that the readBytes does not
> > match the expected size, the file read is successful so update the vfd
> > offset as early as possible. In our case, this might not matter much,
> > but who knows, in the future, in the exception block (say, in
> > ReorderBufferIterTXNFinish, someone assumes that the file offset is
> > correct and does something with that, then we will get in trouble,
> > although I agree that it's very unlikely.
> >
>
> I am not sure if there is any such need, but even if it is there, I
> think updating after a *short* read (read less than expected) doesn't
> seem like a good idea because there is clearly some problem with the
> read call.  Also, in the case below that case where we read the actual
> change data, the offset is updated after the check of *short* read.  I
> don't see any advantage in such an inconsistency.  I still feel it is
> better to update the offset after all error checks.
Ok, no problem; I don't see any harm in doing the updates after the size checks.

By the way, the backport patch is turning out to be simpler. It's
because in pre-12 versions, the file offset is part of the Vfd
structure, so all the offset handling is not required.

>
> >
> > > and see if you can run perltidy for the test file.
> > Hmm, I tried perltidy, and it seems to mostly add a space after ( and
> > a space before ) if there's already; so "('postgres'," is replaced by
> > "( 'postgres',". And this is going to be inconsistent with
> > other places. And it replaces tab with spaces. Do you think we should
> > try perltidy, or have we before been using this tool for the tap tests
> > ?
> >
>
> See text in src/test/perl/README (Note that all tests and test tools
> should have perltidy run on them before patches are submitted, using
> perltidy - profile=src/tools/pgindent/perltidyrc).  It is recommended
> to use perltidy.
>
> Now, if it is making the added code inconsistent with nearby code,
> then I suggest to leave it.
In many places, it is becoming inconsistent, but will see if there are
some places where it does make sense and does not break consistency.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-16 Thread Amit Khandekar
On Sat, 14 Dec 2019 at 11:59, Amit Kapila  wrote:
>
> On Thu, Dec 12, 2019 at 9:50 PM Amit Khandekar  wrote:
> >
> > Attached is a v4 patch that also addresses your code comments so far.
> > I have included the test case in 006_logical_decoding.pl. I observed
> > that the test case just adds only about 0.5 to 1 sec time. Please
> > verify on your env also, and also whether the test reproduces the
> > issue without the code changes.
> >
>
> It takes roughly the same time on my machine as well.  I have checked
> on Windows as well, it increases the time from 14 to 16 (17) seconds
> for this test.  I don't think this is any big increase considering the
> timing of other tests and it would be good to have a test for such
> boundary conditions.  I have slightly change the comments in the patch
> and ran pgindent.  Attached, find the patch with a proposed commit
> message.
>
> I have also made minor changes related to below code in patch:
> - else if (readBytes != sizeof(ReorderBufferDiskChange))
> +
> + file->curOffset += readBytes;
> +
> + if (readBytes !=
> sizeof(ReorderBufferDiskChange))
>
> Why the size is added before the error check?
The logic was : even though it's an error that the readBytes does not
match the expected size, the file read is successful so update the vfd
offset as early as possible. In our case, this might not matter much,
but who knows, in the future, in the exception block (say, in
ReorderBufferIterTXNFinish, someone assumes that the file offset is
correct and does something with that, then we will get in trouble,
although I agree that it's very unlikely. But IMO, because we want to
simulate the file offset support in vfd, we should update the file
offset immediately after a file read is known to have succeeded.

>  I think it should be
> after that check, so changed accordingly.  Similarly, I don't see why
> we need to change 'else if' to 'if' in this code, so changed back.
Since for adding the size before the error check I had to remove the
else-if, so to be consistent, I removed the else-if at surrounding
places also.

>
> I think we need to change/tweak the test for back branches as there we
> don't have logical_decoding_work_mem.  Can you please look into that
Yeah, I believe we need to backport up to PG 9.4 where logical
decoding was introduced, so I am first trying out with 9.4 branch.

> and see if you can run perltidy for the test file.
Hmm, I tried perltidy, and it seems to mostly add a space after ( and
a space before ) if there's already; so "('postgres'," is replaced by
"( 'postgres',". And this is going to be inconsistent with
other places. And it replaces tab with spaces. Do you think we should
try perltidy, or have we before been using this tool for the tap tests
?


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-12-15 Thread Amit Khandekar
On Thu, 12 Dec 2019 at 15:28, Rahila Syed  wrote:
>
> Hi Amit,
>
>
> 2.  Following test fails with error.
> make -C src/test/recovery/ check 
> PROVE_TESTS=t/018_standby_logical_decoding_xmins.pl
> #   Failed test 'physical catalog_xmin not null'
> #   at t/018_standby_logical_decoding_xmins.pl line 120.
> #  got: ''
> # expected: anything else
>
> #   Failed test 'physical catalog_xmin not null'
> #   at t/018_standby_logical_decoding_xmins.pl line 141.
> #  got: ''
> # expected: anything else
>
> #   Failed test 'physical catalog_xmin not null'
> #   at t/018_standby_logical_decoding_xmins.pl line 159.
> #  got: ''
> # expected: anything else
> t/018_standby_logical_decoding_xmins.pl .. 20/27 # poll_query_until timed out 
> executing this query:
> #
>
> Physical catalog_xmin is NULL on master after logical slot creation on 
> standby .

Hi, do you consistently get this failure on your machine ? I am not
able to get this failure, but I am going to analyze when/how this can
fail. Thanks





--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-12 Thread Amit Khandekar
On Thu, 12 Dec 2019 at 14:18, Amit Kapila  wrote:
>
> On Thu, Dec 12, 2019 at 11:53 AM Amit Khandekar  
> wrote:
> >
> > On Thu, 12 Dec 2019 at 11:34, Amit Khandekar  wrote:
> >
> > > So max_changes_in_memory is the one
> > > that allows us to reduce the number of transactions required, so we
> > > can cut down on the outer loop iterations and make the test finish
> > > much earlier.
> >
> > >
> > > But also note that, we can't use the test suite in
> > > contrib/test_decoding, because max_changes_in_memory needs server
> > > restart. So we need to shift this test to src/test/recovery. And
> > > there, I guess it is not that critical for the testcase to be very
> > > quick because the tests in general are much slower than the ones in
> > > contrib/test_decoding, although it would be nice to make it fast. What
> > > I propose is to modify  max_changes_in_memory, do a server restart
> > > (which takes hardly a sec), run the testcase (3.5 sec) and then
> > > restart after resetting the guc. So totally it will be around 4-5
> > > seconds.
> >
> > Sorry I meant max_files_per_process.
> >
>
> Okay, what time other individual tests take in that directory on your
> machine?
For src/test/recovery directory, on average, a test takes about 4-5 seconds.

> How about providing a separate test patch for this case so
> that I can also test it?
Attached is a v4 patch that also addresses your code comments so far.
I have included the test case in 006_logical_decoding.pl. I observed
that the test case just adds only about 0.5 to 1 sec time. Please
verify on your env also, and also whether the test reproduces the
issue without the code changes.

> I think we can take the opinion of others as
> well if they are fine with adding this test, otherwise, we can go
> ahead with the main patch.
Sure.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


use_vfd_for_logrep_v4.patch
Description: Binary data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-11 Thread Amit Khandekar
On Thu, 12 Dec 2019 at 11:34, Amit Khandekar  wrote:

> So max_changes_in_memory is the one
> that allows us to reduce the number of transactions required, so we
> can cut down on the outer loop iterations and make the test finish
> much earlier.

>
> But also note that, we can't use the test suite in
> contrib/test_decoding, because max_changes_in_memory needs server
> restart. So we need to shift this test to src/test/recovery. And
> there, I guess it is not that critical for the testcase to be very
> quick because the tests in general are much slower than the ones in
> contrib/test_decoding, although it would be nice to make it fast. What
> I propose is to modify  max_changes_in_memory, do a server restart
> (which takes hardly a sec), run the testcase (3.5 sec) and then
> restart after resetting the guc. So totally it will be around 4-5
> seconds.

Sorry I meant max_files_per_process. We need to reduce
max_files_per_process, so that it causes max_safe_fds to be reduced,
and so only a few transactions are sufficient to reproduce the
problem, because the reserveAllocatedDesc() will return false much
sooner due to low max_safe_fds.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-11 Thread Amit Khandekar
On Thu, 12 Dec 2019 at 09:49, Amit Kapila  wrote:
>
> On Wed, Dec 11, 2019 at 4:17 PM Amit Khandekar  wrote:
> >
> > On Sat, 7 Dec 2019 at 11:37, Amit Kapila  wrote:
> > >
> > > On Fri, Dec 6, 2019 at 5:00 PM Amit Khandekar  
> > > wrote:
> > > >
> > > > On Fri, 6 Dec 2019 at 15:40, Amit Kapila  
> > > > wrote:
> > > > >
> > > > > 1.
> > > > > + /* Now that the state fields are initialized, it is safe to return 
> > > > > it. */
> > > > > + *iter_state = state;
> > > > > +
> > > > >   /* allocate heap */
> > > > >   state->heap =
> > > > > binaryheap_allocate(state->nr_txns,
> > > > > ReorderBufferIterCompare,
> > > > >
> > > > > Is there a reason for not initializing iter_state after
> > > > > binaryheap_allocate?  If we do so, then we don't need additional check
> > > > > you have added in ReorderBufferIterTXNFinish.
> > > >
> > > > If iter_state is initialized *after* binaryheap_allocate, then we
> > > > won't be able to close the vfds if binaryheap_allocate() ereports().
> > > >
> > >
> > > Is it possible to have vfds opened before binaryheap_allocate(), if so 
> > > how?
> > No it does not look possible for the vfds to be opened before
> > binaryheap_allocate(). But actually, the idea behind placing the
> > iter_state at the place where I put is that, we should return back the
> > iter_state at the *earliest* place in the code where it is safe to
> > return.
> >
>
> Sure, I get that point, but it seems it is equally good to do this
> after binaryheap_allocate().  It will be sligthly better because if
> there is any error in binaryheap_allocate, then we don't need to even
> call ReorderBufferIterTXNFinish().

All right. WIll do that.

>
> >
> > >> 4. I think we should also check how much time increase will happen for
> > >> test_decoding regression test after the test added by this patch?
> > Amit Khandekar  wrote:
> > > Yeah, it currently takes noticeably longer compared to the others.
> > > Let's see if setting logical_decoding_work_mem to a min value allows
> > > us to reproduce the test with much lesser number of inserts.
> >
> > The test in the patch takes around 20 seconds, as compared to the max
> > time of 2 seconds any of the other tests take in that test suite.
> >
> > But if we set the max_files_per_process to a very low value (say 26)
> > as against the default 1000, we can reproduce the issue with as low as
> > 20 sub-transactions as against the 600 that I used in spill.sql test.
> > And with this, the test runs in around 4 seconds, so this is good.
> >
>
> Do you get 4s even after setting the minimum value of
> logical_decoding_work_mem?  I think 4s is also too much for this test
> which is going to test one extreme scenario. Can you try with some
> bigger rows or something else to reduce this time?  I think if we can
> get it close to 2s or whatever maximum time taken by any other logical
> decoding tests, then good, otherwise, it doesn't seem like a good idea
> to add this test.

>
> > But
> > the problem is : max_files_per_process needs server restart. So either
> > we have to shift this test to src/test/recovery in one of the
> > logical_decoding test, or retain it in contrib/test_decoding and let
> > it run for 20 seconds. Let me know if you figure out any other
> > approach.
> >
>
> I don't think 20s for one test is acceptable.  So, we should just give
> up that path, you can try what I suggested above, if we can reduce the
> test time, then good, otherwise, I suggest to drop the test from the
> patch.  Having said that, we should try our best to reduce the test
> time as it will be good if we can have such a test.

I tried; it is actually roughly 3.4 seconds. Note that reducing
logical_decoding_work_mem does not reduce the test time. It causes the
serialization to start early, and so increases the chance of
reproducing the problem. During restore of the serialized data, we
still use max_changes_in_memory. So max_changes_in_memory is the one
that allows us to reduce the number of transactions required, so we
can cut down on the outer loop iterations and make the test finish
much earlier.

But also note that, we can't use the test suite in
contrib/test_decoding, because max_changes_in_memory needs server
restart. So we need to shift this test to src/test/recovery. And
there, I guess it is not that critical for the testcase to be very
quick because the tests in general are much slower than the ones in
contrib/test_decoding, although it would be nice to make it fast. What
I propose is to modify  max_changes_in_memory, do a server restart
(which takes hardly a sec), run the testcase (3.5 sec) and then
restart after resetting the guc. So totally it will be around 4-5
seconds.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-11 Thread Amit Khandekar
On Sat, 7 Dec 2019 at 11:37, Amit Kapila  wrote:
>
> On Fri, Dec 6, 2019 at 5:00 PM Amit Khandekar  wrote:
> >
> > On Fri, 6 Dec 2019 at 15:40, Amit Kapila  wrote:
> > >
> > >
> > > Few comments:
> > > --
> > >
> > > 1.
> > > + /* Now that the state fields are initialized, it is safe to return it. 
> > > */
> > > + *iter_state = state;
> > > +
> > >   /* allocate heap */
> > >   state->heap =
> > > binaryheap_allocate(state->nr_txns,
> > > ReorderBufferIterCompare,
> > >
> > > Is there a reason for not initializing iter_state after
> > > binaryheap_allocate?  If we do so, then we don't need additional check
> > > you have added in ReorderBufferIterTXNFinish.
> >
> > If iter_state is initialized *after* binaryheap_allocate, then we
> > won't be able to close the vfds if binaryheap_allocate() ereports().
> >
>
> Is it possible to have vfds opened before binaryheap_allocate(), if so how?
No it does not look possible for the vfds to be opened before
binaryheap_allocate(). But actually, the idea behind placing the
iter_state at the place where I put is that, we should return back the
iter_state at the *earliest* place in the code where it is safe to
return.


> > I couldn't reproduce the original problem (on HEAD) reported with the
> > test case in the patch.  So, I can't verify the fix.  I think it is
> > because of recent commits cec2edfa7859279f36d2374770ca920c59c73dd8 and
> > 9290ad198b15d6b986b855d2a58d087a54777e87.  It seems you need to either
> > change the value of logical_decoding_work_mem or change the test in
> > some way so that the original problem can be reproduced.
Amit Khandekar  wrote:
> Yeah, it does seem like the commit cec2edfa78592 must have caused the
> test to not reproduce on your env, although the test does fail for me
> still. Setting logical_decoding_work_mem to a low value does sound
> like a good idea. Will work on it.

I checked that setting logical_decoding_work_mem to its min value
(64KB) causes early serialization. So I think if you set this, you
should be able to reproduce with the spill.sql test that has the new
testcase. I will anyway set this value in the test. Also check below
...

>> 4. I think we should also check how much time increase will happen for
>> test_decoding regression test after the test added by this patch?
Amit Khandekar  wrote:
> Yeah, it currently takes noticeably longer compared to the others.
> Let's see if setting logical_decoding_work_mem to a min value allows
> us to reproduce the test with much lesser number of inserts.

The test in the patch takes around 20 seconds, as compared to the max
time of 2 seconds any of the other tests take in that test suite.

But if we set the max_files_per_process to a very low value (say 26)
as against the default 1000, we can reproduce the issue with as low as
20 sub-transactions as against the 600 that I used in spill.sql test.
And with this, the test runs in around 4 seconds, so this is good. But
the problem is : max_files_per_process needs server restart. So either
we have to shift this test to src/test/recovery in one of the
logical_decoding test, or retain it in contrib/test_decoding and let
it run for 20 seconds. Let me know if you figure out any other
approach.


>
> > >
> > > 5. One naive question about the usage of PathNameOpenFile().  When it
> > > reaches the max limit, it will automatically close one of the files,
> > > but how will that be reflected in the data structure (TXNEntryFile)
> > > you are managing.  Basically, when PathNameOpenFile closes some file,
> > > how will the corresponding vfd in TXNEntryFile be changed. Because if
> > > it is not changed, then won't it start pointing to some wrong
> > > filehandle?
> >
> > In PathNameOpenFile(), excess kernel fds could be closed
> > (ReleaseLruFiles). But with that, the vfds themselves don't get
> > invalidated. Only the underlying kernel fd gets closed, and the
> > vfd->fd is marked VFD_CLOSED. The vfd array element remains valid (a
> > non-null vfd->fileName means the vfd slot is valid; check
> > FileIsValid). So later, when FileRead(vfd1) is called and that vfd1
> > happens to be the one that had got it's kernel fd closed, it gets
> > opened again through FileAccess().
> >
>
> I was under impression that once the fd is closed due to excess kernel
> fds that are opened, the slot in VfdCache array could be resued by
> someone else, but on closer inspection that is not true.  It will be
> only available for reuse after we explicitly call FileClose, right?

Yes, that's right.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-06 Thread Amit Khandekar
On Fri, 6 Dec 2019 at 15:40, Amit Kapila  wrote:
>
> On Thu, Dec 5, 2019 at 4:20 PM Amit Kapila  wrote:
> >
> > On Tue, Dec 3, 2019 at 11:10 AM Amit Khandekar  
> > wrote:
> > >
> > >
> > > Done as stated above; attached v3 patch. I have verified that the file
> > > handles do get closed in PG_CATCH block via
> > > ReorderBufferIterTXNFinish().
> > >
> >
> > I couldn't reproduce the original problem (on HEAD) reported with the
> > test case in the patch.  So, I can't verify the fix.  I think it is
> > because of recent commits cec2edfa7859279f36d2374770ca920c59c73dd8 and
> > 9290ad198b15d6b986b855d2a58d087a54777e87.  It seems you need to either
> > change the value of logical_decoding_work_mem or change the test in
> > some way so that the original problem can be reproduced.

Yeah, it does seem like the commit cec2edfa78592 must have caused the
test to not reproduce on your env, although the test does fail for me
still. Setting logical_decoding_work_mem to a low value does sound
like a good idea. Will work on it.

> >
>
> Few comments:
> --
>
> 1.
> + /* Now that the state fields are initialized, it is safe to return it. */
> + *iter_state = state;
> +
>   /* allocate heap */
>   state->heap =
> binaryheap_allocate(state->nr_txns,
> ReorderBufferIterCompare,
>
> Is there a reason for not initializing iter_state after
> binaryheap_allocate?  If we do so, then we don't need additional check
> you have added in ReorderBufferIterTXNFinish.

If iter_state is initialized *after* binaryheap_allocate, then we
won't be able to close the vfds if binaryheap_allocate() ereports().

>
> 2.
> /* No harm in resetting the offset even in case of failure */
> file->curOffset = 0;
>
> The above comment is not clear because you are not setting it in case
> of error rather this is a success path.

I meant, even if PathNameOpenFile() failed, it is ok to set
file->curOffset to 0, so need not set it only in case of *fd >= 0. In
most of the cases, fd would be valid, so just set file->curOffset to 0
always.

>
> 3.
> + *
> + * Note: The iterator state is returned through iter_state parameter rather
> + * than the function's return value. This is because the state gets cleaned 
> up
> + * in a PG_CATCH block, so we want to make sure the caller gets back the 
> state
> + * even if this function throws an exception, so that the state resources can
> + * be cleaned up.
>
> How about changing it slightly as follows to make it more clear.
> "Note: The iterator state is returned through iter_state parameter
> rather than the function's return value.  This is because the state
> gets cleaned up in a PG_CATCH block in the caller, so we want to make
> sure the caller gets back the state even if this function throws an
> exception."

Agreed. Will do that in the next patch version.

>
> 4. I think we should also check how much time increase will happen for
> test_decoding regression test after the test added by this patch?
Yeah, it currently takes noticeably longer compared to the others.
Let's see if setting logical_decoding_work_mem to a min value allows
us to reproduce the test with much lesser number of inserts.

>
> 5. One naive question about the usage of PathNameOpenFile().  When it
> reaches the max limit, it will automatically close one of the files,
> but how will that be reflected in the data structure (TXNEntryFile)
> you are managing.  Basically, when PathNameOpenFile closes some file,
> how will the corresponding vfd in TXNEntryFile be changed. Because if
> it is not changed, then won't it start pointing to some wrong
> filehandle?

In PathNameOpenFile(), excess kernel fds could be closed
(ReleaseLruFiles). But with that, the vfds themselves don't get
invalidated. Only the underlying kernel fd gets closed, and the
vfd->fd is marked VFD_CLOSED. The vfd array element remains valid (a
non-null vfd->fileName means the vfd slot is valid; check
FileIsValid). So later, when FileRead(vfd1) is called and that vfd1
happens to be the one that had got it's kernel fd closed, it gets
opened again through FileAccess().

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-12-02 Thread Amit Khandekar
On Wed, 27 Nov 2019 at 14:16, Amit Khandekar  wrote:
> What I found was : We do attempt to close the opened vfds in the
> PG_CATCH block. In ReorderBufferCommit(), ReorderBufferIterTXNFinish
> is called both in PG_TRY and PG_CATCH. This closes all the opened
> vfds. But the issue is : if the ereport() occurs inside
> ReorderBufferIterTXNInit(), then iterstate is still NULL. So in
> PG_CATCH, ReorderBufferIterTXNFinish() is not called, so the vfds in
> state->entries[] remain open.
>
> We can have  passed to ReorderBufferIterTXNInit() as another
> argument, and initialize it first thing inside the function. This way,
> it will never be NULL. But need to be careful about the possibility of
> having a iterstate in a half-cooked state, so cleanup might use some
> uninitialized handles. Will work on it. At least, we can make sure the
> iterstate->entries handle doesn't have junk values.

Done as stated above; attached v3 patch. I have verified that the file
handles do get closed in PG_CATCH block via
ReorderBufferIterTXNFinish().

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


use_vfd_for_logrep_v3.patch
Description: Binary data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-27 Thread Amit Khandekar
On Tue, 26 Nov 2019 at 12:10, Amit Kapila  wrote:
>
> On Tue, Nov 26, 2019 at 11:19 AM Amit Khandekar  
> wrote:
> >
> > On Tue, 26 Nov 2019 at 10:49, Amit Kapila  wrote:
> > >
> > >
> > > So, what is the next step here?  How about if we somehow check whether
> > > the file exists before doing unlink, say by using stat?
> > But the thing is, the behaviour is so much in a grey area, that we
> > cannot reliably say for instance that when stat() says there is no
> > such file, there is indeed no such file,
> >
>
> Why so?
This was just an example. What I meant was, we are really not sure of
the behaviour of file operations when the file is in this state.
unlink() returning "Permission denied" when called twice is itself
weird enough.

>
> > and if we re-create  the same
> > file when it is still open, it is always going to open a new file,
> > etc.
> >
>
> Yeah, or maybe even if we don't create with the same name, there is
> always be some dangling file which again doesn't sound like a good
> thing.

Right.

>
> > > If that doesn't work, I think we might need to go in the direction of 
> > > tracking
> > > file handles in some way, so that they can be closed during an abort.
> > Yeah, that is one way. I am still working on different approaches.
> > WIll get back with proposals.
> >
>
> Fair enough.  See, if you can also consider an approach that is local
> to ReorderBuffer module wherein we can track those handles in
> ReorderBufferTxn or some other place local to that module.

What I found was : We do attempt to close the opened vfds in the
PG_CATCH block. In ReorderBufferCommit(), ReorderBufferIterTXNFinish
is called both in PG_TRY and PG_CATCH. This closes all the opened
vfds. But the issue is : if the ereport() occurs inside
ReorderBufferIterTXNInit(), then iterstate is still NULL. So in
PG_CATCH, ReorderBufferIterTXNFinish() is not called, so the vfds in
state->entries[] remain open.

We can have  passed to ReorderBufferIterTXNInit() as another
argument, and initialize it first thing inside the function. This way,
it will never be NULL. But need to be careful about the possibility of
having a iterstate in a half-cooked state, so cleanup might use some
uninitialized handles. Will work on it. At least, we can make sure the
iterstate->entries handle doesn't have junk values.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-25 Thread Amit Khandekar
On Tue, 26 Nov 2019 at 10:49, Amit Kapila  wrote:
>
> On Fri, Nov 22, 2019 at 7:38 PM Amit Khandekar  wrote:
> >
> > On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila  wrote:
> >>
> >> I think this is exactly the reason for the problem.  In my test [1],
> >> the error "permission denied" occurred when I second time executed
> >> pg_logical_slot_get_changes() which means on first execution the
> >> unlink would have been successful but the files are still not removed
> >> as they were not closed. Then on second execution, it gets an error
> >> "Permission denied" when it again tries to unlink files via
> >> ReorderBufferCleanupSerializedTXNs().
> >>
> >>
> >> .
> >> > But what you are seeing is "Permission denied" errors. Not sure why
> >> > unlink() is failing.
> >> >
> >>
> >> In your test program, if you try to unlink the file second time, you
> >> should see the error "Permission denied".
> >
> >  I tested using the sample program and indeed I got the error 5 (access 
> > denied) when I called unlink the second time.
> >
>
> So, what is the next step here?  How about if we somehow check whether
> the file exists before doing unlink, say by using stat?
But the thing is, the behaviour is so much in a grey area, that we
cannot reliably say for instance that when stat() says there is no
such file, there is indeed no such file, and if we re-create  the same
file when it is still open, it is always going to open a new file,
etc.

> If that doesn't work, I think we might need to go in the direction of tracking
> file handles in some way, so that they can be closed during an abort.
Yeah, that is one way. I am still working on different approaches.
WIll get back with proposals.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-22 Thread Amit Khandekar
On Fri, 22 Nov 2019 at 4:26 PM, Amit Kapila  wrote:

> On Fri, Nov 22, 2019 at 11:00 AM Amit Khandekar 
> wrote:
> >
> > On Fri, 22 Nov 2019 at 09:08, Amit Kapila 
> wrote:
> > > Have you tried before that fix , if not, can you once try by
> > > temporarily reverting that fix in your environment and share the
> > > output of each step?  After you get the error due to EOF, check that
> > > you have .spill files in pg_replslot// and then again try
> > > to get changes by pg_logical_slot_get_changes().   If you want, you
> > > can use the test provided in Amit Khandekar's patch.
> >
> > On my Linux machine, I added elog() in ReorderBufferRestoreChanges(),
> > just after FileRead() returns 0. This results in error. But the thing
> is, in
> > ReorderBufferCommit(), the error is already handled using PG_CATCH :
> >
> > PG_CATCH();
> > {
> > .
> >AbortCurrentTransaction();
> > ...
> >if (using_subtxn)
> >   RollbackAndReleaseCurrentSubTransaction();
> > 
> > 
> >/* remove potential on-disk data, and deallocate */
> >ReorderBufferCleanupTXN(rb, txn);
> > }
> >
> > So ReorderBufferCleanupTXN() removes all the .spill files using unlink().
> >
> > And on Windows, what should happen is : unlink() should succeed
> > because the file is opened using FILE_SHARE_DELETE. But the files
> > should still remain there because these are still open. It is just
> > marked for deletion until there is no one having opened the file. That
> > is what is my conclusion from running a sample attached program test.c
> >
>
> I think this is exactly the reason for the problem.  In my test [1],
> the error "permission denied" occurred when I second time executed
> pg_logical_slot_get_changes() which means on first execution the
> unlink would have been successful but the files are still not removed
> as they were not closed. Then on second execution, it gets an error
> "Permission denied" when it again tries to unlink files via
> ReorderBufferCleanupSerializedTXNs().
>
>
> .
> > But what you are seeing is "Permission denied" errors. Not sure why
> > unlink() is failing.
> >
>
> In your test program, if you try to unlink the file second time, you
> should see the error "Permission denied".

 I tested using the sample program and indeed I got the error 5 (access
denied) when I called unlink the second time.

>
>
> [1] -
> https://www.postgresql.org/message-id/CAA4eK1%2Bcey6i6a0zD9kk_eaDXb4RPNZqu4UwXO9LbHAgMpMBkg%40mail.gmail.com
>
>
>
> --
> With Regards,
> Amit Kapila.
> EnterpriseDB: http://www.enterprisedb.com
>
-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-21 Thread Amit Khandekar
On Fri, 22 Nov 2019 at 09:08, Amit Kapila  wrote:
> Have you tried before that fix , if not, can you once try by
> temporarily reverting that fix in your environment and share the
> output of each step?  After you get the error due to EOF, check that
> you have .spill files in pg_replslot// and then again try
> to get changes by pg_logical_slot_get_changes().   If you want, you
> can use the test provided in Amit Khandekar's patch.

On my Linux machine, I added elog() in ReorderBufferRestoreChanges(),
just after FileRead() returns 0. This results in error. But the thing is, in
ReorderBufferCommit(), the error is already handled using PG_CATCH :

PG_CATCH();
{
.
   AbortCurrentTransaction();
...
   if (using_subtxn)
  RollbackAndReleaseCurrentSubTransaction();


   /* remove potential on-disk data, and deallocate */
   ReorderBufferCleanupTXN(rb, txn);
}

So ReorderBufferCleanupTXN() removes all the .spill files using unlink().

And on Windows, what should happen is : unlink() should succeed
because the file is opened using FILE_SHARE_DELETE. But the files
should still remain there because these are still open. It is just
marked for deletion until there is no one having opened the file. That
is what is my conclusion from running a sample attached program test.c
.
But what you are seeing is "Permission denied" errors. Not sure why
unlink() is failing.

The thing that is still a problem is : On Windows, if the file remains
open, and later even when the unlink() succeeds, the file will be left
there until it is closed. So subsequent operations will open the same
old file. Not sure what happens if we open a file that is marked for
deletion.

-

Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
//#include 
#include 
#include 

void pgwin32_open(const char *fileName, int fileFlags,...)
{

	HANDLE		h;
	SECURITY_ATTRIBUTES sa;

	sa.nLength = sizeof(sa);
	sa.bInheritHandle = TRUE;
	sa.lpSecurityDescriptor = NULL;
	
	h = CreateFile(fileName,
		   GENERIC_READ,
	/* These flags allow concurrent rename/unlink */
		   (FILE_SHARE_READ | FILE_SHARE_WRITE | FILE_SHARE_DELETE),
		   ,
		   OPEN_EXISTING,
		   FILE_ATTRIBUTE_NORMAL ,
		   NULL);

	if (h == INVALID_HANDLE_VALUE)
	{
		printf("File could not be opened. Error %d\n", GetLastError());
		exit(1);
	}

	/* CloseHandle(h); */


	if (_unlink(fileName) != 0)
	{
		printf("File could not be deleted. Error %d\n", GetLastError());
		exit(1);
	}
}

int main(void)
{
	pgwin32_open("c:/temp/amit/file", 0 /* O_RDONLY | PG_BINARY */);
}


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-20 Thread Amit Khandekar
On Wed, 20 Nov 2019 at 19:24, Alvaro Herrera  wrote:
>
> On 2019-Nov-20, Juan José Santamaría Flecha wrote:
>
> > I was not able to reproduce the Permission denied error with current HEAD,
> > until I opened another CMD inside the "pg_replslot/regression_slot" folder.
> > This will be problematic, is the deletion of the folder actually needed?
>
> Yes :-(  The code assumes that if the directory is there, then it's
> valid.  Trying to remove that assumption is probably a more invasive
> fix.
>
> I think ReplicationSlotDropAcquired is too pessimistic (no recourse if
> the rename fails) and too optimistic (this will almost never happen).
> We could change it so that the rename is retried a few times, and avoid
> the failure.  (Naturally, the rmtree should also be retried.)  The code
> seems written with the POSIX semantics in mind, but it seems easy to
> improve.

Just to be clear, there are two issues being discussed here :

1. Issue with the patch, where pg_replslot/slotname/xid-*.spill files
can't be removed because the same backend process has left these files
opened because of an abort. This is happening despite the file being
opened using FILE_SHARE_DELETE flag. I am going to investigate
(possibly the flag is not applicable in case a single process is
involved)

2. This existing issue where pg_replslot/slotname directory removal
will fail if someone else is accessing this directory. This has
nothing to do with the patch.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-20 Thread Amit Khandekar
On Wed, 20 Nov 2019 at 13:10, Amit Kapila  wrote:
> See comment in pgunlink() "We need to loop because even though
> PostgreSQL uses flags that allow unlink while the file is open, other
> applications might have the file
> open without those flags.".  Can you once see if there is any flag
> that you have missed to pass to allow this?

> If there is nothing we
> can do about it, then we might need to use some different API or maybe
> define a new API that can handle this.

There were objections against modifying the vfd api only for this
replication-related use-case. Having a new API will require all the
changes required to enable the virtual FDs feature that we need from
vfd. If nothing works out from the FILE_SHARE_DELETE thing, I am
thinking, we can use VFD, plus we can keep track of per-subtransaction
vfd handles, and do something similar to AtEOSubXact_Files().


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-20 Thread Amit Khandekar
On Wed, 20 Nov 2019 at 13:10, Amit Kapila  wrote:
>
> On Tue, Nov 19, 2019 at 4:58 PM Amit Khandekar  wrote:
> >
> > On Tue, 19 Nov 2019 at 14:07, Amit Kapila  wrote:
> > >
> > >
> > > Have you tried by injecting some error?  After getting the error
> > > mentioned above in email, when I retried the same query, I got the
> > > below message.
> > >
> > > postgres=# SELECT 1 from
> > > pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> > > ERROR:  could not remove file
> > > "pg_replslot/regression_slot/xid-1693-lsn-0-1800.spill" during
> > > removal of pg_replslot/regression_slot/xid*: Permission denied
> > >
> > > And, then I tried to drop the replication slot and I got below error.
> > > postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot');
> > > ERROR:  could not rename file "pg_replslot/regression_slot" to
> > > "pg_replslot/regression_slot.tmp": Permission denied
> > >
> > > It might be something related to Windows
> >
> > Oh ok, I missed the fact that on Windows we can't delete the files
> > that are already open, unlike Linux/Unix.
> >
>
> See comment in pgunlink() "We need to loop because even though
> PostgreSQL uses flags that allow unlink while the file is open, other
> applications might have the file
> open without those flags.".  Can you once see if there is any flag
> that you have missed to pass to allow this?  If there is nothing we
> can do about it, then we might need to use some different API or maybe
> define a new API that can handle this.

Hmm, looks like there is one such flag: FILE_SHARE_DELETE. When file
is opened with this flag, other processes can delete as well as rename
the file.

But it turns out that in pgwin32_open(), we already use
FILE_SHARE_DELETE. So, this is again confusing.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Amit Khandekar
On Wed, 20 Nov 2019 at 01:05, Thomas Munro  wrote:
>
> On Wed, Nov 20, 2019 at 7:58 AM Thomas Munro  wrote:
> > On Wed, Nov 20, 2019 at 1:14 AM Juan José Santamaría Flecha
> > > https://devblogs.microsoft.com/oldnewthing/20150121-00/?p=44863
> >
> > !?!

Thanks Juan and Thomas for pointing to these links where already this
was discussed.

>
> One thing I don't understand (besides, apparently, the documentation):
> how did this problem escape detection by check-world for such a long
> time?  Surely we expect to hit the end of various temporary files in
> various tests.  Is it intermittent, or dependent on Windows version,
> or something like that?

Possibly there aren't any callers who try to pread() at end-of-file
using FileRead/pg_pread :

- mdread() seems to read from an offset which it seems to know that it
is inside the end-of file, including the whole BLCKSZ.
- BufFileLoadBuffer() seems to deliberately ignore FileRead()'s return
value if it is -1
if (file->nbytes < 0) file->nbytes = 0;
- XLogPageRead() also seems to know that the offset is a valid offset.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-19 Thread Amit Khandekar
On Tue, 19 Nov 2019 at 14:07, Amit Kapila  wrote:
>
> On Mon, Nov 18, 2019 at 5:50 PM Amit Khandekar  wrote:
> >
> > On Mon, 18 Nov 2019 at 17:20, Amit Kapila  wrote:
> > > I see that you have made changes in ReorderBufferRestoreChanges to use
> > > PathNameOpenFile, but not in ReorderBufferSerializeTXN.  Is there a
> > > reason for the same?  In my test environment, with the test provided
> > > by you, I got the error (reported in this thread) via
> > > ReorderBufferSerializeTXN.
> >
> > You didn't get this error with the patch applied, did you ?
> >
>
> No, I got this before applying the patch.  However, after applying the
> patch, I got below error in the same test:
>
> postgres=# SELECT 1 from
> pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> ERROR:  could not read from reorderbuffer spill file: Invalid argument
>
> It seems to me that FileRead API used in the patch can return value <
> 0 on  EOF.  See the API usage in BufFileLoadBuffer.  I got this error
> on a windows machine and in the server log the message was "LOG:
> unrecognized win32 error code: 38" which indicates "Reached the end of
> the file."

On Windows, it is documented that ReadFile() (which is called by
pg_pread) will return false on EOF but only when the file is open for
asynchronous reads/writes. But here we are just dealing with usual
synchronous reads. So pg_pread() code should indeed return 0 on EOF on
Windows. Not yet able to figure out how FileRead() managed to return
this error on Windows. But from your symptoms, it does look like
pg_pread()=>ReadFile() returned false (despite doing asynchronous
reads), and so _dosmaperr() gets called, and then it does not find the
eof error in doserrors[], so the "unrecognized win32 error code"
message is printed. May have to dig up more on this.


>
> > If you were debugging this without the patch applied, I suspect that
> > the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is
> > generating this error is because the max limit must be already crossed
> > because of earlier calls to ReorderBufferRestoreChanges().
> >
> > Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is
> > sufficient because the code in that function has made sure the fd gets
> > closed there itself.
> >
>
> Okay, then we might not need it there, but we should at least add a
> comment in ReorderBufferRestoreChanges to explain why we have used a
> different function to operate on the file at that place.

Yeah, that might make sense.

>
> >
> > For the API's that use VFDs (like PathNameOpenFile), the files opened
> > are always recorded in the VfdCache array. So it is not required to do
> > the cleanup at (sub)transaction end, because the kernel fds get closed
> > dynamically in ReleaseLruFiles() whenever they reach max_safe_fds
> > limit. So if a transaction aborts, the fds might remain open, but
> > those will get cleaned up whenever we require more fds, through
> > ReleaseLruFiles(). Whereas, for files opened through
> > OpenTransientFile(), VfdCache is not involved, so this needs
> > transaction end cleanup.
> >
>
> Have you tried by injecting some error?  After getting the error
> mentioned above in email, when I retried the same query, I got the
> below message.
>
> postgres=# SELECT 1 from
> pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
> ERROR:  could not remove file
> "pg_replslot/regression_slot/xid-1693-lsn-0-1800.spill" during
> removal of pg_replslot/regression_slot/xid*: Permission denied
>
> And, then I tried to drop the replication slot and I got below error.
> postgres=# SELECT * FROM pg_drop_replication_slot('regression_slot');
> ERROR:  could not rename file "pg_replslot/regression_slot" to
> "pg_replslot/regression_slot.tmp": Permission denied
>
> It might be something related to Windows

Oh ok, I missed the fact that on Windows we can't delete the files
that are already open, unlike Linux/Unix.
I guess, I may have to use FD_CLOSE_AT_EOXACT flags; or simply use
OpenTemporaryFile(). I wonder though if this same issue might come up
for the other use-case of PathNameOpenFile() :
logical_rewrite_log_mapping().

> but you can once try by
> injecting some error after reading a few files in the code path and
> see the behavior.
Yeah, will check the behaviour, although on Linux, I think I won't get
this error. But yes, like I mentioned above, I think we might have to
arrange for something.



--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-18 Thread Amit Khandekar
On Mon, 18 Nov 2019 at 17:52, Amit Kapila  wrote:
> I have one more question regarding this patch.   It seems to me that
> the files opened via OpenTransientFile or OpenTemporaryFile are
> automatically closed at transaction end(abort), but that doesn't seem
> to be the case for files opened with PathNameOpenFile.  See
> AtEOXact_Files and AtEOSubXact_Files.  So, now with the change
> proposed by this patch, don't we need to deal it in some other way?

For the API's that use VFDs (like PathNameOpenFile), the files opened
are always recorded in the VfdCache array. So it is not required to do
the cleanup at (sub)transaction end, because the kernel fds get closed
dynamically in ReleaseLruFiles() whenever they reach max_safe_fds
limit. So if a transaction aborts, the fds might remain open, but
those will get cleaned up whenever we require more fds, through
ReleaseLruFiles(). Whereas, for files opened through
OpenTransientFile(), VfdCache is not involved, so this needs
transaction end cleanup.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-11-18 Thread Amit Khandekar
On Mon, 18 Nov 2019 at 17:20, Amit Kapila  wrote:
> I see that you have made changes in ReorderBufferRestoreChanges to use
> PathNameOpenFile, but not in ReorderBufferSerializeTXN.  Is there a
> reason for the same?  In my test environment, with the test provided
> by you, I got the error (reported in this thread) via
> ReorderBufferSerializeTXN.

You didn't get this error with the patch applied, did you ?

If you were debugging this without the patch applied, I suspect that
the reason why ReorderBufferSerializeTXN() => OpenTransientFile() is
generating this error is because the max limit must be already crossed
because of earlier calls to ReorderBufferRestoreChanges().

Note that in ReorderBufferSerializeTXN(), OpenTransientFile() is
sufficient because the code in that function has made sure the fd gets
closed there itself.

If you are getting this error even with the patch applied, then this
needs investigation.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-11-07 Thread Amit Khandekar
On Thu, 7 Nov 2019 at 14:02, Rahila Syed  wrote:
>
> Hi Amit,
>
> I am reading about this feature and reviewing it.
> To start with, I reviewed the patch: 
> 0005-Doc-changes-describing-details-about-logical-decodin.patch.

Thanks for picking up the patch review.

Your reply somehow spawned a new mail thread, so I reverted back to
this thread for replying.

>
> >prevent VACUUM from removing required rows from the system catalogs,
> >hot_standby_feedback should be set on the standby. In spite of that,
> >if any required rows get removed on standby, the slot gets dropped.
> IIUC, you mean `if any required rows get removed on *the master* the slot gets
> dropped`, right?

Yes, you are right. In fact, I think it is not necessary to explicitly
mention where the rows get removed. So I have just omitted "on
standby". Will include this change in the next patch versions.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-11-04 Thread Amit Khandekar
On Thu, 10 Oct 2019 at 05:49, Craig Ringer  wrote:
>
> On Tue, 1 Oct 2019 at 02:08, Robert Haas  wrote:
>>
>>
>> Why does create_logical_slot_on_standby include sleep(1)?
>
>
> Yeah, we really need to avoid sleeps in regression tests.

Yeah, have already got rid of the sleeps from the patch-series version
4 onwards.

By the way, the couple of patches out of the patch series had
bitrotten. Attached is the rebased version.

Thanks
-Amit Khandekar


logicaldecodng_standby_v4_rebased.tar.gz
Description: application/gzip


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-10-03 Thread Amit Khandekar
On Wed, 18 Sep 2019 at 12:24, Amit Khandekar  wrote:
> Probably, for now at least, what everyone seems to agree is to take my
> earlier attached patch forward.
>
> I am going to see if I can add a TAP test for the patch, and will add
> the patch into the commitfest soon.

Attached is an updated patch v2. Has a new test scenario added in
contrib/test_decoding/sql/spill test, and some minor code cleanup.

Going to add this into Nov commitfest.





--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company
diff --git a/contrib/test_decoding/expected/spill.out b/contrib/test_decoding/expected/spill.out
index 10734bd..e1d150b6 100644
--- a/contrib/test_decoding/expected/spill.out
+++ b/contrib/test_decoding/expected/spill.out
@@ -247,6 +247,25 @@ GROUP BY 1 ORDER BY 1;
  'serialize-nested-subbig-subbigabort-subbig-3 |  5000 | table public.spill_test: INSERT: data[text]:'serialize-nested-subbig-subbigabort-subbig-3:5001' | table public.spill_test: INSERT: data[text]:'serialize-nested-subbig-subbigabort-subbig-3:1'
 (2 rows)
 
+-- large number of spilling subxacts. Should not exceed maxAllocatedDescs
+BEGIN;
+DO $$
+BEGIN
+FOR i IN 1..600 LOOP
+BEGIN
+INSERT INTO spill_test select generate_series(1, 5000) ;
+EXCEPTION
+when division_by_zero then perform 'dummy';
+END;
+END LOOP;
+END $$;
+COMMIT;
+SELECT 1 from pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
+ ?column? 
+--
+1
+(1 row)
+
 DROP TABLE spill_test;
 SELECT pg_drop_replication_slot('regression_slot');
  pg_drop_replication_slot 
diff --git a/contrib/test_decoding/sql/spill.sql b/contrib/test_decoding/sql/spill.sql
index e638cac..715d1f9 100644
--- a/contrib/test_decoding/sql/spill.sql
+++ b/contrib/test_decoding/sql/spill.sql
@@ -174,6 +174,21 @@ SELECT (regexp_split_to_array(data, ':'))[4] COLLATE "C", COUNT(*), (array_agg(d
 FROM pg_logical_slot_get_changes('regression_slot', NULL,NULL) WHERE data ~ 'INSERT'
 GROUP BY 1 ORDER BY 1;
 
+-- large number of spilling subxacts. Should not exceed maxAllocatedDescs
+BEGIN;
+DO $$
+BEGIN
+FOR i IN 1..600 LOOP
+BEGIN
+INSERT INTO spill_test select generate_series(1, 5000) ;
+EXCEPTION
+when division_by_zero then perform 'dummy';
+END;
+END LOOP;
+END $$;
+COMMIT;
+SELECT 1 from pg_logical_slot_get_changes('regression_slot', NULL,NULL) LIMIT 1;
+
 DROP TABLE spill_test;
 
 SELECT pg_drop_replication_slot('regression_slot');
diff --git a/src/backend/replication/logical/reorderbuffer.c b/src/backend/replication/logical/reorderbuffer.c
index 8ce28ad..1e5a725 100644
--- a/src/backend/replication/logical/reorderbuffer.c
+++ b/src/backend/replication/logical/reorderbuffer.c
@@ -103,13 +103,21 @@ typedef struct ReorderBufferTupleCidEnt
 	CommandId	combocid;		/* just for debugging */
 } ReorderBufferTupleCidEnt;
 
+/* Virtual file descriptor with file offset tracking */
+typedef struct TXNEntryFile
+{
+	File		vfd;			/* -1 when the file is closed */
+	off_t		curOffset;		/* offset for next write or read. Reset to 0
+ * when vfd is opened. */
+} TXNEntryFile;
+
 /* k-way in-order change iteration support structures */
 typedef struct ReorderBufferIterTXNEntry
 {
 	XLogRecPtr	lsn;
 	ReorderBufferChange *change;
 	ReorderBufferTXN *txn;
-	int			fd;
+	TXNEntryFile file;
 	XLogSegNo	segno;
 } ReorderBufferIterTXNEntry;
 
@@ -194,7 +202,7 @@ static void ReorderBufferSerializeTXN(ReorderBuffer *rb, ReorderBufferTXN *txn);
 static void ReorderBufferSerializeChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 		 int fd, ReorderBufferChange *change);
 static Size ReorderBufferRestoreChanges(ReorderBuffer *rb, ReorderBufferTXN *txn,
-		int *fd, XLogSegNo *segno);
+		TXNEntryFile *file, XLogSegNo *segno);
 static void ReorderBufferRestoreChange(ReorderBuffer *rb, ReorderBufferTXN *txn,
 	   char *change);
 static void ReorderBufferRestoreCleanup(ReorderBuffer *rb, ReorderBufferTXN *txn);
@@ -988,7 +996,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 
 	for (off = 0; off < state->nr_txns; off++)
 	{
-		state->entries[off].fd = -1;
+		state->entries[off].file.vfd = -1;
 		state->entries[off].segno = 0;
 	}
 
@@ -1013,7 +1021,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 		{
 			/* serialize remaining changes */
 			ReorderBufferSerializeTXN(rb, txn);
-			ReorderBufferRestoreChanges(rb, txn, >entries[off].fd,
+			ReorderBufferRestoreChanges(rb, txn, >entries[off].file,
 		>entries[off].segno);
 		}
 
@@ -1043,7 +1051,7 @@ ReorderBufferIterTXNInit(ReorderBuffer *rb, ReorderBufferTXN *txn)
 /* serialize remaining changes */
 ReorderBufferSerializeTXN(rb, cur_txn);
 ReorderBufferRestoreChanges(rb, cur_txn,
-			>entries[off].fd,
+			>entries[off].file,
 			>entr

Re: Minimal logical decoding on standbys

2019-09-30 Thread Amit Khandekar
On Fri, 27 Sep 2019 at 23:21, Robert Haas  wrote:
>
> On Fri, Sep 27, 2019 at 12:41 PM Amit Khandekar  
> wrote:
> > Preferably I want wait_for_xmins() to get rid of the $node parameter,
> > because we can deduce it using slot name. But that requires having
> > get_node_from_slotname(). Your suggestion was to remove
> > get_node_from_slotname() and add back the $node param so as to reduce
> > duplicate code. Instead, how about keeping  wait_for_xmins() in the
> > PostgresNode.pm() ? This way, we won't have duplication, and also we
> > can get rid of param $node. This is just my preference; if you are
> > quite inclined to not have get_node_from_slotname(), I will go with
> > your suggestion.
>
> I'd be inclined not to have it.  I think having a lookup function to
> go from slot name -> node is strange; it doesn't really simplify
> things that much for the caller, and it makes the logic harder to
> follow. It would break outright if you had the same slot name on
> multiple nodes, which is a perfectly reasonable scenario.

Alright. Attached is the updated patch that splits the file into two
files, one that does only xmin related testing, and the other test
file that tests conflict recovery scenarios, and also one scenario
where drop-database drops the slots on the database on standby.
Removed get_slot_xmins() and get_node_from_slotname().
Renamed 'replica' to 'standby'.
Used node->backup() function instead of pg_basebackup command.
Renamed $master_slot to $master_slotname, similarly for $standby_slot.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


logicaldecodng_standby_v3.tar.gz
Description: GNU Zip compressed data


Re: Minimal logical decoding on standbys

2019-09-27 Thread Amit Khandekar
On Fri, 27 Sep 2019 at 01:57, Robert Haas  wrote:
>
> On Thu, Sep 26, 2019 at 5:14 AM Amit Khandekar  wrote:
> > Actually in some of the conflict-recovery testcases, I am still using
> > wait_for_xmins() so that we could test the xmin values back after we
> > drop the slots. So xmin-related testing is embedded in these recovery
> > tests as well. We can move the wait_for_xmins() function to some
> > common file and then do the split of this file, but then effectively
> > some of the xmin-testing would go into the recovery-related test file,
> > which did not sound sensible to me. What do you say ?
>
> I agree we don't want code duplication, but I think we could reduce
> the code duplication to a pretty small amount with a few cleanups.
>
> I don't think wait_for_xmins() looks very well-designed. It goes to
> trouble of returning a value, but only 2 of the 6 call sites pay
> attention to the returned value.  I think we should change the
> function so that it doesn't return anything and have the callers that
> want a return value call get_slot_xmins() after wait_for_xmins().
Yeah, that can be done.

>
> And then I think we should turn around and get rid of get_slot_xmins()
> altogether. Instead of:
>
> my ($xmin, $catalog_xmin) = get_slot_xmins($master_slot);
> is($xmin, '', "xmin null");
> is($catalog_xmin, '', "catalog_xmin null");
>
> We can write:
>
> my $slot = $node_master->slot($master_slot);
> is($slot->{'xmin'}, '', "xmin null");
> is($slot->{'catalog_xmin'}, '', "catalog xmin null");
>
> ...which is not really any longer or harder to read, but does
> eliminate the need for one function definition.
Agreed.

>
> Then I think we should change wait_for_xmins so that it takes three
> arguments rather than two: $node, $slotname, $check_expr.  With that
> and the previous change, we can get rid of get_node_from_slotname().
>
> At that point, the body of wait_for_xmins() would consist of a single
> call to $node->poll_query_until() or die(), which doesn't seem like
> too much code to duplicate into a new file.

Earlier it used to have 3 params, the same ones you mentioned. I
removed $node for caller convenience.

>
> Looking at it at a bit more, though, I wonder why the recovery
> conflict scenario is even using wait_for_xmins().  It's hard-coded to
> check the state of the master_physical slot, which isn't otherwise
> manipulated by the recovery conflict tests. What's the point of
> testing that a slot which had xmin and catalog_xmin NULL before the
> test started (line 414) and which we haven't changed since still has
> those values at two different points during the test (lines 432, 452)?
> Perhaps I'm missing something here, but it seems like this is just an
> inadvertent entangling of these scenarios with the previous scenarios,
> rather than anything that necessarily needs to be connected together.

In the "Drop slot" test scenario, we are testing that after we
manually drop the slot on standby, the master catalog_xmin should be
back to NULL. Hence, the call to wait_for_xmins().
And in the "Scenario 1 : hot_standby_feedback off", wait_for_xmins()
is called the first time only as a mechanism to ensure that
"hot_standby_feedback = off" has taken effect. At the end of this
test, wait_for_xmins() again is called only to ensure that
hot_standby_feedback = on has taken effect.

Preferably I want wait_for_xmins() to get rid of the $node parameter,
because we can deduce it using slot name. But that requires having
get_node_from_slotname(). Your suggestion was to remove
get_node_from_slotname() and add back the $node param so as to reduce
duplicate code. Instead, how about keeping  wait_for_xmins() in the
PostgresNode.pm() ? This way, we won't have duplication, and also we
can get rid of param $node. This is just my preference; if you are
quite inclined to not have get_node_from_slotname(), I will go with
your suggestion.
--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-09-26 Thread Amit Khandekar
do the split of this file, but then effectively
some of the xmin-testing would go into the recovery-related test file,
which did not sound sensible to me. What do you say ?

Attached patch series has the test changes addressed.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


logicaldecodng_standby_v2.tar.gz
Description: GNU Zip compressed data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-09-18 Thread Amit Khandekar
On Tue, 17 Sep 2019 at 21:19, Andres Freund  wrote:
> On 2019-09-14 14:34:21 -0400, Tom Lane wrote:
> > Amit Khandekar  writes:
> > > Yeah, something like the attached patch. I think this tracking of
> > > offsets would have been cleaner if we add in-built support in VFD. But
> > > yeah, for bank branches at least, we need to handle it outside of VFD.
> > > Or may be we would add it if we find one more use-case.
> >
> > Again, we had that and removed it, for what seem to me to be solid
> > reasons.  It adds cycles when we're forced to close/reopen a file,
> > and it also adds failure modes that we could do without (ie, failure
> > of either the ftell or the lseek, which are particularly nasty because
> > they shouldn't happen according to the VFD abstraction).

Ok. So you mean, when the caller would call FileRead() for sequential
reading, underneath VFD would do a pread(), but if pread() returns
error, the errno can belong to read() or it might as well belong to
lseek(). If it's due to lseek(), it's not expected from the caller
because for the caller it's just a sequential read. Yeah, makes sense.

>>  I do not
> > think there is going to be any argument strong enough to make us
> > put it back, especially not for non-mainstream callers like logical
> > decoding.

Ok. Also, more about putting back is in the below comments ...

>
> Yea, I think that's the right call. Avoiding kernel seeks is quite
> worthwhile, and we shouldn't undo it just because of this usecase. And
> that'll become more and more important performance-wise (and has already
> done so, with all the intel fixes making syscalls much slower).

By the way, I was not thinking about adding back the read() and
lseek() calls. I was saying we continue to use the pread() call, so
it's just a single system call. FileReadAt(..., offset) would do
pread() with user-supplied offset, and FileRead() would do pread()
using internally tracked offset. So for the user, FileReadAt() is like
pread(), and FileRead() would be like read().

But I agree with Tom's objection about having to unnecessarily handle
lseek error codes.

>
> I could see an argument for adding a separate generic layer providing
> position tracking ontop of the VFD abstraction however. Seems quite
> possible that there's some other parts of the system that could benefit
> from using VFDs rather than plain fds. And they'd probably also need the
> positional tracking.

Yeah, that also could be done.

Probably, for now at least, what everyone seems to agree is to take my
earlier attached patch forward.

I am going to see if I can add a TAP test for the patch, and will add
the patch into the commitfest soon.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-09-14 Thread Amit Khandekar
On Fri, 13 Sep 2019 at 22:01, Robert Haas  wrote:
> On Fri, Sep 13, 2019 at 12:14 PM Tom Lane  wrote:
> > Again, though, the advice that's been given here is that we should
> > fix logical decoding to use the VFD API as it stands, not change
> > that API.  I concur with that.
>
> A reasonable position.  So I guess logical decoding has to track the
> file position itself, but perhaps use the VFD layer for managing FD
> pooling.

Yeah, something like the attached patch. I think this tracking of
offsets would have been cleaner if we add in-built support in VFD. But
yeah, for bank branches at least, we need to handle it outside of VFD.
Or may be we would add it if we find one more use-case.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


use_vfd_for_logrep.patch
Description: Binary data


Re: logical decoding : exceeded maxAllocatedDescs for .spill files

2019-09-13 Thread Amit Khandekar
On Thu, 12 Sep 2019 at 19:11, Robert Haas  wrote:
>
> On Thu, Sep 12, 2019 at 5:31 AM Tomas Vondra
>  wrote:
> > I don't see how the current API could do that transparently - it does
> > track the files, but the user only gets a file descriptor. With just a
> > file descriptor, how could the code know to do reopen/seek when it's going
> > just through the regular fopen/fclose?
> >
> > Anyway, I agree we need to do something, to fix this corner case (many
> > serialized in-progress transactions). ISTM we have two options - either do
> > something in the context of reorderbuffer.c, or extend the transient file
> > API somehow. I'd say the second option is the right thing going forward,
> > because it does allow doing it transparently and without leaking details
> > about maxAllocatedDescs etc. There are two issues, though - it does
> > require changes / extensions to the API, and it's not backpatchable.
> >
> > So maybe we should start with the localized fix in reorderbuffer, and I
> > agree tracking offset seems reasonable.
>

> We've already got code that knows how to track this sort of thing.

You mean tracking excess kernel fds right ? Yeah, we can use VFDs so
that excess fds are automatically closed. But Alvaro seems to be
talking in context of tracking of file seek position. VFD  does not
have a mechanism to track file offsets if one of the vfd cached file
is closed and reopened. Robert, are you suggesting to add this
capability to VFD ? I agree that we could do it, but for
back-patching, offhand I couldn't think of a simpler way.



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-09-13 Thread Amit Khandekar
On Mon, 9 Sep 2019 at 16:06, Amit Khandekar  wrote:
>
> On Tue, 3 Sep 2019 at 23:10, Alvaro Herrera  wrote:
> >
> > On 2019-Jul-19, Amit Khandekar wrote:
> >
> > > Attached are the split patches. Included is an additional patch that
> > > has doc changes. Here is what I have added in the docs. Pasting it
> > > here so that all can easily spot how it is supposed to behave, and to
> > > confirm that we are all on the same page :
> >
> > ... Apparently, this patch was not added to the commitfest for some
> > reason; and another patch that *is* in the commitfest has been said to
> > depend on this one (Petr's https://commitfest.postgresql.org/24/1961/
> > which hasn't been updated in quite a while and has received no feedback
> > at all, not even from the listed reviewer Shaun Thomas).  To make
> > matters worse, Amit's patchset no longer applies.
> >
> > What I would like to do is add a link to this thread to CF's 1961 entry
> > and then update all these patches, in order to get things moving.
>
> Hi Alvaro,
>
> Thanks for notifying about this. Will work this week on rebasing this
> patchset and putting it into the 2019-11 commit fest.

Rebased patch set attached.

Added in the Nov commitfest : https://commitfest.postgresql.org/25/2283/



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


logicaldecodng_standby_v1_rebased.tar.gz
Description: GNU Zip compressed data


logical decoding : exceeded maxAllocatedDescs for .spill files

2019-09-11 Thread Amit Khandekar
Hi,

I reproduced the error "exceeded maxAllocatedDescs (492) while trying
to open file ...", which was also discussed about in the thread [1].
This issue is similar but not exactly the same as [1]. In [1], the
file for which this error used to show up was
"pg_logical/mappings/map" , while here it's the .spill file. And
here the issue , in short, seems to be : The .spill file does not get
closed there and then, unlike in [1] where there was a file descriptor
leak.

I could reproduce it using a transaction containing a long series of
sub-transactions (possibly could be reproduced without
sub-transactions, but looking at the code I could come up with this
script using sub-transactions easily) :

create table tab(id int);

-- Function that does huge changes in a single transaction
create or replace function f(id int) returns int as
$$
begin
-- Iterate this more than 492 times (max transient file
descriptors PG would typically allow)
-- This will create that many sub-transactions due to presence of
exception block.
FOR i IN 1..600 LOOP

BEGIN
-- Iterate more than 4096 times (so that changes spill to
disk: max_changes_in_memory)
FOR j IN 1..5000 LOOP
   insert into tab values (1);
END LOOP;
EXCEPTION
when division_by_zero then perform 'dummy';
END;

END LOOP;

return id;
end $$ language plpgsql;

SELECT * FROM pg_create_logical_replication_slot('logical', 'test_decoding');

begin;
select f(1); -- Do huge changes in a single transaction
commit;

\! pg_recvlogical -d postgres  --slot=logical --verbose --start -f -

pg_recvlogical: starting log streaming at 0/0 (slot logical)
pg_recvlogical: streaming initiated
pg_recvlogical: confirming write up to 0/0, flush to 0/0 (slot logical)
BEGIN 1869
pg_recvlogical: confirming write up to 0/1B6D6E38, flush to 0/1B6D6E38
(slot logical)
pg_recvlogical: error: unexpected termination of replication stream:
ERROR:  exceeded maxAllocatedDescs (492) while trying to open file
"pg_replslot/logical/xid-2362-lsn-0-2400.spill"
pg_recvlogical: disconnected; waiting 5 seconds to try again

Looking at the code, what might be happening is,
ReorderBufferIterTXNInit()=>ReorderBufferRestoreChanges() opens the
files, but leaves them open if end of file is not reached. Eventually
if end of file is reached, it gets closed. The function returns back
without closing the file descriptor if  reorder buffer changes being
restored are more than max_changes_in_memory. Probably later on, the
rest of the changes get restored in another
ReorderBufferRestoreChanges() call. But meanwhile, if there are a lot
of such files opened, we can run out of the max files that PG decides
to keep open (it has some logic that takes into account system files
allowed to be open, and already opened).

Offhand, what I am thinking is, we need to close the file descriptor
before returning from ReorderBufferRestoreChanges(), and keep track of
the file offset and file path, so that next time we can resume reading
from there.

Comments ?

[1] 
https://www.postgresql.org/message-id/flat/738a590a-2ce5-9394-2bef-7b1caad89b37%402ndquadrant.com

--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-09-09 Thread Amit Khandekar
On Tue, 3 Sep 2019 at 23:10, Alvaro Herrera  wrote:
>
> On 2019-Jul-19, Amit Khandekar wrote:
>
> > Attached are the split patches. Included is an additional patch that
> > has doc changes. Here is what I have added in the docs. Pasting it
> > here so that all can easily spot how it is supposed to behave, and to
> > confirm that we are all on the same page :
>
> ... Apparently, this patch was not added to the commitfest for some
> reason; and another patch that *is* in the commitfest has been said to
> depend on this one (Petr's https://commitfest.postgresql.org/24/1961/
> which hasn't been updated in quite a while and has received no feedback
> at all, not even from the listed reviewer Shaun Thomas).  To make
> matters worse, Amit's patchset no longer applies.
>
> What I would like to do is add a link to this thread to CF's 1961 entry
> and then update all these patches, in order to get things moving.

Hi Alvaro,

Thanks for notifying about this. Will work this week on rebasing this
patchset and putting it into the 2019-11 commit fest.




Re: POC: Cleaning up orphaned files using undo logs

2019-07-26 Thread Amit Khandekar
ssCompleted(uur->uur_txn->urec_progress))
{
if (dbid_exists(uur->uur_txn->urec_dbid))
{
(void) RegisterRollbackReq(InvalidUndoRecPtr,
   undo_recptr,
   uur->uur_txn->urec_dbid,
   uur->uur_fxid);

   pending_abort = true;
}
}
-

+UndoRecordRelease(uur);
+uur = NULL;
+}
.
.
+Assert(uur == NULL);
+
+/* If we reach here, this means there is something to discard. */
+need_discard = true;
+} while (true);

Looks like it is neither necessary to set uur to NULL, nor is it
necessary to have the Assert(uur == NULL). At the start of each
iteration uur is anyway assigned a fresh value, which may or may not
be NULL.
-

+ * over undo logs is complete, new undo can is allowed to be written in the
new undo can is allowed => new undo is allowed

+ * hash table size.  So before start allowing any new transaction to write the
before start allowing => before allowing any new transactions to start
writing the
-

+/* Get the smallest of 'xid having pending undo' and 'oldestXmin' */
+oldestXidHavingUndo = RollbackHTGetOldestFullXid(oldestXidHavingUndo);
+   
+   
+if (FullTransactionIdIsValid(oldestXidHavingUndo))
+pg_atomic_write_u64(>oldestFullXidHavingUnappliedUndo,
+U64FromFullTransactionId(oldestXidHavingUndo));

Is it possible that the FullTransactionId returned by
RollbackHTGetOldestFullXid() could be invalid ? If not, then the if
condition above can be changed to an Assert().
-


+ * If the log is already discarded, then we are done.  It is important
+ * to first check this to ensure that tablespace containing this log
+ * doesn't get dropped concurrently.
+ */
+LWLockAcquire(>mutex, LW_SHARED);
+/*
+ * We don't have to worry about slot recycling and check the logno
+ * here, since we don't care about the identity of this slot, we're
+ * visiting all of them.
I guess, it's accidental that the LWLockAcquire() call is *between*
the two comments ?
---

+if (UndoRecPtrGetCategory(undo_recptr) == UNDO_SHARED)
+{
+/*
+ * For the "shared" category, we only discard when the
+ * rm_undo_status callback tells us we can.
+ */
+status = RmgrTable[uur->uur_rmid].rm_undo_status(uur,
_xid);
status variable could be declared in this block itself.
-


Some variable declaration alignments and comments spacing need changes
as per pgindent.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: POC: Cleaning up orphaned files using undo logs

2019-07-23 Thread Amit Khandekar
On Tue, 23 Jul 2019 at 08:48, Amit Kapila  wrote:
>
> On Mon, Jul 22, 2019 at 8:39 PM Amit Khandekar  wrote:
> >
> > On Mon, 22 Jul 2019 at 14:21, Amit Kapila  wrote:
> >
> > -
> >
> > +UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
> > +{
> > + /* Block concurrent access. */
> > + LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
> > +
> > + MyUndoWorker = >workers[slot];
> > Not sure why MyUndoWorker is used here. Can't we use a local variable
> > ? Or do we intentionally attach to the slot as a side-operation ?
> >
> > -
> >
>
> I think here, we can use a local variable as well.  Do you see any
> problem with the current code or do you think it is better to use a
> local variable here?

I think, even though there might not be a correctness issue with the
current code as it stands, we should still use a local variable.
Updating MyUndoWorker is a big side-effect, which the caller is not
supposed to be aware of, because all that function should do is just
get the slot info.

>
> > --
> >
> > + if (!InsertRequestIntoErrorUndoQueue(urinfo))
> > I was thinking what happens if for some reason
> > InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the
> > entry will not be marked invalid, and so there will be no undo action
> > carried out because I think the undo worker will exit. What happens
> > next with this entry ?
>
> The same entry is present in two queues xid and size, so next time it
> will be executed from the second queue based on it's priority in that
> queue.  However, if it fails again a second time in the same way, then
> we will be in trouble because now the hash table has entry, but none
> of the queues has entry, so none of the workers will attempt to
> execute again.  Also, when discard worker again tries to register it,
> we won't allow adding the entry to queue thinking either some backend
> is executing the same or it must be part of some queue.
>
> The one possibility to deal with this could be that we somehow allow
> discard worker to register it again in the queue or we can do this in
> critical section so that it allows system restart on error.  However,
> the main thing is it possible that InsertRequestIntoErrorUndoQueue
> will fail unless there is some bug in the code?  If so, we might want
> to have an Assert for this rather than handling this condition.

Yes, I also think that the function would error out only because of
can't-happen cases, like "too many locks taken" or "out of binary heap
slots" or "out of memory" (this last one is not such a can't happen
case). These cases happen probably due to some bugs, I suppose. But I
was wondering : Generally when the code errors out with such
can't-happen elog() calls, worst thing that happens is that the
transaction gets aborted. Whereas, in this case, the worst thing that
could happen is : the undo action would never get executed, which
means selects for this tuple will keep on accessing the undo log ?
This does not sound like any data consistency issue, so we should be
fine after all ?



Some further review comments for undoworker.c :


+/* Sets the worker's lingering status. */
+static void
+UndoWorkerIsLingering(bool sleep)
The function name sounds like "is the worker lingering ?". Can we
rename it to something like "UndoWorkerSetLingering" ?

-

+ errmsg("undo worker slot %d is empty, cannot attach",
+ slot)));

+ }
+
+ if (MyUndoWorker->proc)
+ {
+ LWLockRelease(UndoWorkerLock);
+ ereport(ERROR,
+ (errcode(ERRCODE_OBJECT_NOT_IN_PREREQUISITE_STATE),
+ errmsg("undo worker slot %d is already used by "
+ "another worker, cannot attach", slot)));

These two error messages can have a common error message "could not
attach to worker slot", with errdetail separate for each of them :
slot %d is empty.
slot %d is already used by another worker.

--

+static int
+IsUndoWorkerAvailable(void)
+{
+ int i;
+ int alive_workers = 0;
+
+ LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);

Should have bool return value.

Also, why is it keeping track of number of alive workers ? Sounds like
earlier it used to return number of alive workers ? If it indeed needs
to just return true/false, we can do away with alive_workers.

Also, *exclusive* lock is unnecessary.

--

+if (UndoGetWork(false, false, , NULL) &&
+IsUndoWorkerAvailable())
+UndoWorkerLaunch(urinfo);

There is no lock acquired between IsUndoWorkerAvailable() and
UndoWorkerLaunch(); that means even though IsUndoWorkerAvailable()
returns true, there is a small window where UndoWorkerLaunch() does
not find any worker slot with in_use fal

Re: POC: Cleaning up orphaned files using undo logs

2019-07-22 Thread Amit Khandekar
On Fri, 19 Jul 2019 at 17:24, Amit Khandekar  wrote:
>
> On Thu, 9 May 2019 at 12:04, Dilip Kumar  wrote:
>
> > Patches can be applied on top of undo branch [1] commit:
> > (cb777466d008e656f03771cf16ec7ef9d6f2778b)
> >
> > [1] https://github.com/EnterpriseDB/zheap/tree/undo
>
> Below are some review points for 0009-undo-page-consistency-checker.patch :

Another point that I missed :

+* Process the undo record of the page and mask their cid filed.
+*/
+   while (next_record < page_end)
+   {
+   UndoRecordHeader *header = (UndoRecordHeader *) next_record;
+
+   /* If this undo record has cid present, then mask it */
+   if ((header->urec_info & UREC_INFO_CID) != 0)

Here, even though next record starts in the current page, the
urec_info itself may or may not lie on this page.

I hope this possibility is also considered when populating the
partial-record-specific details in the page header.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: POC: Cleaning up orphaned files using undo logs

2019-07-22 Thread Amit Khandekar
On Mon, 22 Jul 2019 at 14:21, Amit Kapila  wrote:

I have started review of
0014-Allow-execution-and-discard-of-undo-by-background-wo.patch. Below
are some quick comments to start with:

+++ b/src/backend/access/undo/undoworker.c

+#include "access/xact.h"
+#include "access/undorequest.h"
Order is not alphabetical

+ * Each undo worker then start reading from one of the queue the requests for
start=>starts
queue=>queues

-

+ rc = WaitLatch(MyLatch,
+WL_LATCH_SET | WL_TIMEOUT | WL_POSTMASTER_DEATH,
+10L, WAIT_EVENT_BGWORKER_STARTUP);
+
+ /* emergency bailout if postmaster has died */
+ if (rc & WL_POSTMASTER_DEATH)
+ proc_exit(1);
I think now, thanks to commit cfdf4dc4fc9635a, you don't have to
explicitly handle postmaster death; instead you can use
WL_EXIT_ON_PM_DEATH. Please check at all such places where this is
done in this patch.

-

+UndoWorkerGetSlotInfo(int slot, UndoRequestInfo *urinfo)
+{
+ /* Block concurrent access. */
+ LWLockAcquire(UndoWorkerLock, LW_EXCLUSIVE);
+
+ MyUndoWorker = >workers[slot];
Not sure why MyUndoWorker is used here. Can't we use a local variable
? Or do we intentionally attach to the slot as a side-operation ?

-

+ * Get the dbid where the wroker should connect to and get the worker
wroker=>worker

-

+ BackgroundWorkerInitializeConnectionByOid(urinfo.dbid, 0, 0);
0, 0 => InvalidOid, 0

+ * Set the undo worker request queue from which the undo worker start
+ * looking for a work.
start => should start
a work => work

--

+ if (!InsertRequestIntoErrorUndoQueue(urinfo))
I was thinking what happens if for some reason
InsertRequestIntoErrorUndoQueue() itself errors out. In that case, the
entry will not be marked invalid, and so there will be no undo action
carried out because I think the undo worker will exit. What happens
next with this entry ?




Re: POC: Cleaning up orphaned files using undo logs

2019-07-19 Thread Amit Khandekar
On Thu, 9 May 2019 at 12:04, Dilip Kumar  wrote:

> Patches can be applied on top of undo branch [1] commit:
> (cb777466d008e656f03771cf16ec7ef9d6f2778b)
>
> [1] https://github.com/EnterpriseDB/zheap/tree/undo

Below are some review points for 0009-undo-page-consistency-checker.patch :


+ /* Calculate the size of the partial record. */
+ partial_rec_size = UndoRecordHeaderSize(phdr->uur_info) +
+ phdr->tuple_len + phdr->payload_len -
+ phdr->record_offset;

There is already an UndoPagePartialRecSize() function which calculates
the size of partial record, which seems to do the same as above. If
this is same, you can omit the above code, and instead down below
where you increment next_record, you can do "next_record +=
UndoPagePartialRecSize()".

Also, I see an extra  sizeof(uint16) added in
UndoPagePartialRecSize(). Not sure which one is correct, and which one
wrong, unless I am wrong in assuming that the above calculation and
the function definition do the same thing.

--

+ * We just want to mask the cid in the undo record header.  So
+ * only if the partial record in the current page include the undo
+ * record header then we need to mask the cid bytes in this page.
+ * Otherwise, directly jump to the next record.
Here, I think you mean : "So only if the partial record in the current
page includes the *cid* bytes", rather than "includes the undo record
header"
May be we can say :
We just want to mask the cid. So do the partial record masking only if
the current page includes the cid bytes from the partial record
header.



+ if (phdr->record_offset < (cid_offset + sizeof(CommandId)))
+ {
+char*cid_data;
+Size mask_size;
+
+mask_size = Min(cid_offset - phdr->record_offset,
+sizeof(CommandId));
+
+cid_data = next_record + cid_offset - phdr->record_offset;
+memset(_data, MASK_MARKER, mask_size);
+
Here, if record_offset lies *between* cid start and cid end, then
cid_offset - phdr->record_offset will be negative, and so will be
mask_size. Probably abs() should do the work.

Also, an Assert(cid_data + mask_size <= page_end) would be nice. I
know cid position of a partial record cannot go beyond the page
boundary, but it's better to have this Assert to do sanity check.

+ * Process the undo record of the page and mask their cid filed.
filed => field

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-07-18 Thread Amit Khandekar
On Wed, 10 Jul 2019 at 17:12, Amit Khandekar  wrote:
>
> On Wed, 10 Jul 2019 at 08:44, Andres Freund  wrote:
> >
> > Hi,
> >
> > Thanks for the new version! Looks like we're making progress towards
> > something committable here.
> >
> > I think it'd be good to split the patch into a few pieces. I'd maybe do
> > that like:
> > 1) WAL format changes (plus required other changes)
> > 2) Recovery conflicts with slots
> > 3) logical decoding on standby
> > 4) tests
>
> All right. Will do that in the next patch set. For now, I have quickly
> done the below changes in a single patch again (attached), in order to
> get early comments if any.

Attached are the split patches. Included is an additional patch that
has doc changes. Here is what I have added in the docs. Pasting it
here so that all can easily spot how it is supposed to behave, and to
confirm that we are all on the same page :

"A logical replication slot can also be created on a hot standby. To
prevent VACUUM from removing required rows from the system catalogs,
hot_standby_feedback should be set on the standby. In spite of that,
if any required rows get removed on standby, the slot gets dropped.
Existing logical slots on standby also get dropped if wal_level on
primary is reduced to less than 'logical'.

For a logical slot to be created, it builds a historic snapshot, for
which information of all the currently running transactions is
essential. On primary, this information is available, but on standby,
this information has to be obtained from primary. So, slot creation
may wait for some activity to happen on the primary. If the primary is
idle, creating a logical slot on standby may take a noticeable time."


logicaldecodng_standby.tar.gz
Description: GNU Zip compressed data


Re: Minimal logical decoding on standbys

2019-07-16 Thread Amit Khandekar
On Tue, 16 Jul 2019 at 22:56, Andres Freund  wrote:
>
> Hi,
>
> On 2019-07-12 14:53:21 +0530, tushar wrote:
> > On 07/10/2019 05:12 PM, Amit Khandekar wrote:
> > > All right. Will do that in the next patch set. For now, I have quickly
> > > done the below changes in a single patch again (attached), in order to
> > > get early comments if any.
> > Thanks Amit for your patch. i am able to see 1 issues  on Standby server -
> > (where  logical replication slot created ) ,
> > a)size of  pg_wal folder  is NOT decreasing even after firing get_changes
> > function
>
> Even after calling pg_logical_slot_get_changes() multiple times? What
> does
> SELECT * FROM pg_replication_slots; before and after multiple calls return?
>
> Does manually forcing a checkpoint with CHECKPOINT; first on the primary
> and then the standby "fix" the issue?

I independently tried to reproduce this issue on my machine yesterday.
I observed that :
sometimes, the files get cleaned up after two or more
pg_logical_slot_get_changes().
Sometimes, I have to restart the server to see the pg_wal files cleaned up.
This happens more or less the same even for logical slot on *primary*.

Will investigate further with Tushar.


>
>
> > b)pg_wal files are not recycling  and every time it is creating new files
> > after firing get_changes function
>
> I'm not sure what you mean by this. Are you saying that
> pg_logical_slot_get_changes() causes WAL to be written?
>
> Greetings,
>
> Andres Freund



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-07-10 Thread Amit Khandekar
;will".

>
>
>
> > @@ -2879,6 +2882,25 @@ RecoveryConflictInterrupt(ProcSignalReason reason)
> >   case PROCSIG_RECOVERY_CONFLICT_LOCK:
> >   case PROCSIG_RECOVERY_CONFLICT_TABLESPACE:
> >   case PROCSIG_RECOVERY_CONFLICT_SNAPSHOT:
> > + case PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT:
> > + /*
> > +  * For conflicts that require a logical slot 
> > to be dropped, the
> > +  * requirement is for the signal receiver to 
> > release the slot,
> > +  * so that it could be dropped by the signal 
> > sender. So for
> > +  * normal backends, the transaction should be 
> > aborted, just
> > +  * like for other recovery conflicts. But if 
> > it's walsender on
> > +  * standby, then it has to be killed so as to 
> > release an
> > +  * acquired logical slot.
> > +  */
> > + if (am_cascading_walsender &&
> > + reason == 
> > PROCSIG_RECOVERY_CONFLICT_LOGICALSLOT &&
> > + MyReplicationSlot && 
> > SlotIsLogical(MyReplicationSlot))
> > + {
> > + RecoveryConflictPending = true;
> > + QueryCancelPending = true;
> > + InterruptPending = true;
> > + break;
> > + }
>
> Huh, I'm not following as to why that's needed for walsenders?

For normal backends, we ignore this signal if we aren't in a
transaction (block). But for walsender, there is no transaction, but
we cannot ignore the signal. This is because walsender can keep a
logical slot acquired when it was spawned by "pg_recvlogical --start".
So we can't ignore the signal. So the only way that we can make it
release the acquired slot is to kill it.

>
>
> > @@ -1499,6 +1499,7 @@ pg_stat_get_db_conflict_all(PG_FUNCTION_ARGS)
> > 
> > dbentry->n_conflict_tablespace +
> > dbentry->n_conflict_lock +
> > 
> > dbentry->n_conflict_snapshot +
> > +   
> > dbentry->n_conflict_logicalslot +
> > 
> > dbentry->n_conflict_bufferpin +
> > 
> > dbentry->n_conflict_startup_deadlock);
>
> I think this probably needs adjustments in a few more places,
> e.g. monitoring.sgml...

Oops, yeah, to search for similar additions, I had looked for
"conflict_snapshot" using cscope. I should have done the same using
"git grep".
Done now.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


logical-decoding-on-standby_v12.patch
Description: Binary data


Re: Minimal logical decoding on standbys

2019-07-04 Thread Amit Khandekar
On Thu, 4 Jul 2019 at 17:21, Amit Khandekar  wrote:
>
> On Thu, 4 Jul 2019 at 15:52, tushar  wrote:
> >
> > On 07/01/2019 11:04 AM, Amit Khandekar wrote:
> >
> > Also, in the updated patch (v11), I have added some scenarios that
> > verify that slot is dropped when either master wal_level is
> > insufficient, or when slot is conflicting. Also organized the test
> > file a bit.
> >
> > One scenario where replication slot removed even after fixing the problem 
> > (which Error message suggested to do)
>
> Which specific problem are you referring to ? Removing a conflicting
> slot, itself is the part of the fix for the conflicting slot problem.
>
> >
> > Please refer this below scenario
> >
> > Master cluster-
> > postgresql,conf file
> > wal_level=logical
> > hot_standby_feedback = on
> > port=5432
> >
> > Standby cluster-
> > postgresql,conf file
> > wal_level=logical
> > hot_standby_feedback = on
> > port=5433
> >
> > both Master/Slave cluster are up and running and are in SYNC with each other
> > Create a logical replication slot on SLAVE ( SELECT * from   
> > pg_create_logical_replication_slot('m', 'test_decoding'); )
> >
> > change wal_level='hot_standby' on Master postgresql.conf file / restart the 
> > server
> > Run get_changes function on Standby -
> > postgres=# select * from pg_logical_slot_get_changes('m',null,null);
> > ERROR:  logical decoding on standby requires wal_level >= logical on master
> >
> > Correct it on Master postgresql.conf file ,i.e set  wal_level='logical'  
> > again / restart the server
> > and again fire  get_changes function on Standby -
> > postgres=# select * from pg_logical_slot_get_changes('m',null,null);
> > ERROR:  replication slot "m" does not exist
> >
> > This looks little weird as slot got dropped/removed internally . i guess it 
> > should get invalid rather than removed automatically.
> > Lets user's  delete the slot themself rather than automatically removed  as 
> > a surprise.
>
> It was earlier discussed about what action should be taken when we
> find conflicting slots. Out of the options, one was to drop the slot,
> and we chose that because that was simple. See this :
> https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de

Sorry, the above link is not the one I wanted to refer to. Correct one is this :

https://www.postgresql.org/message-id/20181214005521.jaty2d24lz4nroil%40alap3.anarazel.de

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




Re: Minimal logical decoding on standbys

2019-07-04 Thread Amit Khandekar
On Thu, 4 Jul 2019 at 15:52, tushar  wrote:
>
> On 07/01/2019 11:04 AM, Amit Khandekar wrote:
>
> Also, in the updated patch (v11), I have added some scenarios that
> verify that slot is dropped when either master wal_level is
> insufficient, or when slot is conflicting. Also organized the test
> file a bit.
>
> One scenario where replication slot removed even after fixing the problem 
> (which Error message suggested to do)

Which specific problem are you referring to ? Removing a conflicting
slot, itself is the part of the fix for the conflicting slot problem.

>
> Please refer this below scenario
>
> Master cluster-
> postgresql,conf file
> wal_level=logical
> hot_standby_feedback = on
> port=5432
>
> Standby cluster-
> postgresql,conf file
> wal_level=logical
> hot_standby_feedback = on
> port=5433
>
> both Master/Slave cluster are up and running and are in SYNC with each other
> Create a logical replication slot on SLAVE ( SELECT * from   
> pg_create_logical_replication_slot('m', 'test_decoding'); )
>
> change wal_level='hot_standby' on Master postgresql.conf file / restart the 
> server
> Run get_changes function on Standby -
> postgres=# select * from pg_logical_slot_get_changes('m',null,null);
> ERROR:  logical decoding on standby requires wal_level >= logical on master
>
> Correct it on Master postgresql.conf file ,i.e set  wal_level='logical'  
> again / restart the server
> and again fire  get_changes function on Standby -
> postgres=# select * from pg_logical_slot_get_changes('m',null,null);
> ERROR:  replication slot "m" does not exist
>
> This looks little weird as slot got dropped/removed internally . i guess it 
> should get invalid rather than removed automatically.
> Lets user's  delete the slot themself rather than automatically removed  as a 
> surprise.

It was earlier discussed about what action should be taken when we
find conflicting slots. Out of the options, one was to drop the slot,
and we chose that because that was simple. See this :
https://www.postgresql.org/message-id/flat/20181212204154.nsxf3gzqv3gesl32%40alap3.anarazel.de

By the way, you are getting the "logical decoding on standby requires
wal_level >= logical on master" error while using the slot, which is
because we reject the command even before checking the existence of
the slot. Actually the slot is already dropped due to master
wal_level. Then when you correct the master wal_level, the command is
not rejected, and proceeds to use the slot and then it is found that
the slot does not exist.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company




  1   2   3   >