from:"Amit Khandekar"

Re: Speeding up GIST index creation for tsvectors

2021-08-01 Thread Amit Khandekar

On Sat, 20 Mar 2021 at 02:19, John Naylor  wrote:
> On Fri, Mar 19, 2021 at 8:57 AM Amit Khandekar  wrote:
> > Regarding the alignment changes... I have removed the code that
> > handled the leading identically unaligned bytes, for lack of evidence
> > that percentage of such cases is significant. Like I noted earlier,
> > for the tsearch data I used, identically unaligned cases were only 6%.
> > If I find scenarios where these cases can be significant after all and
> > if we cannot do anything in the gist index code, then we might have to
> > bring back the unaligned byte handling. I didn't get a chance to dig
> > deeper into the gist index implementation to see why they are not
> > always 8-byte aligned.
>
> I find it stranger that something equivalent to char* is not randomly 
> misaligned, but rather only seems to land on 4-byte boundaries.
>
> [thinks] I'm guessing it's because of VARHDRSZ, but I'm not positive.
>
> FWIW, I anticipate some push back from the community because of the fact that 
> the optimization relies on statistical phenomena.

I dug into this issue for tsvector type. Found out that it's the way
in which the sign array elements are arranged that is causing the pointers to
be misaligned:

Datum
gtsvector_picksplit(PG_FUNCTION_ARGS)
{
..
cache = (CACHESIGN *) palloc(sizeof(CACHESIGN) * (maxoff + 2));
cache_sign = palloc(siglen * (maxoff + 2));

for (j = 0; j < maxoff + 2; j++)
cache[j].sign = _sign[siglen * j];

}

If siglen is not a multiple of 8 (say 700), cache[j].sign will in some
cases point to non-8-byte-aligned addresses, as you can see in the
above code snippet.

Replacing siglen by MAXALIGN64(siglen) in the above snippet gets rid
of the misalignment. This change applied over the 0001-v3 patch gives
additional ~15% benefit. MAXALIGN64(siglen) will cause a bit more
space, but for not-so-small siglens, this looks worth doing. Haven't
yet checked into types other than tsvector.

Will get back with your other review comments. I thought, meanwhile, I
can post the above update first.

Re: speed up verifying UTF-8

2021-07-18 Thread Amit Khandekar

On Sat, 17 Jul 2021 at 04:48, John Naylor  wrote:
> v17-0001 is the same as v14. 0002 is a stripped-down implementation of Amit's
> chunk idea for multibyte, and it's pretty good on x86. On Power8, not so
> much. 0003 and 0004 are shot-in-the-dark guesses to improve it on Power8,
> with some success, but end up making x86 weirdly slow, so I'm afraid that
> could happen on other platforms as well.

Thanks for trying the chunk approach. I tested your v17 versions on
Arm64. For the chinese characters, v17-0002 gave some improvement over
v14. But for all the other character sets, there was around 10%
degradation w.r.t. v14. I thought maybe the hhton64 call and memcpy()
for each mb character might be the culprit, so I tried iterating over
all the characters in the chunk within the same pg_utf8_verify_one()
function by left-shifting the bits. But that worsened the figures. So
I gave up that idea.

Here are the numbers on Arm64 :

HEAD:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
1781 |  1095 |   628 | 944 |   1151

v14:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 852 |   484 |   144 | 584 |971

v17-0001+2:
 chinese | mixed | ascii | mixed16 | mixed8
-+---+---+-+
 731 |   520 |   152 | 645 |   1118

Haven't looked at your v18 patch set yet.

Re: speed up verifying UTF-8

2021-07-14 Thread Amit Khandekar

On Tue, 13 Jul 2021 at 01:15, John Naylor  wrote:
> > It seems like it would be easy to have pg_utf8_verify_one in my proposed 
> > pg_utf8.h header and replace the body of pg_utf8_verifychar with it.
>
> 0001: I went ahead and tried this for v15, and also attempted some clean-up:
>
> - Rename pg_utf8_verify_one to pg_utf8_verifychar_internal.
> - Have pg_utf8_verifychar_internal return -1 for invalid input to match other 
> functions in the file. We could also do this for check_ascii, but it's not 
> quite the same thing, because the string could still have valid bytes in it, 
> just not enough to advance the pointer by the stride length.
> - Remove hard-coded numbers (not wedded to this).
>
> - Use a call to pg_utf8_verifychar in the slow path.
> - Reduce pg_utf8_verifychar to thin wrapper around 
> pg_utf8_verifychar_internal.

- check_ascii() seems to be used only for 64-bit chunks. So why not
remove the len argument and the len <= sizeof(int64) checks inside the
function. We can rename it to check_ascii64() for clarity.

- I was thinking, why not have a pg_utf8_verify64() that processes
64-bit chunks (or a 32-bit version). In check_ascii(), we anyway
extract a 64-bit chunk from the string. We can use the same chunk to
extract the required bits from a two byte char or a 4 byte char. This
way we can avoid extraction of separate bytes like b1 = *s; b2 = s[1]
etc. More importantly, we can avoid the separate continuation-char
checks for each individual byte. Additionally, we can try to simplify
the subsequent overlong or surrogate char checks. Something like this
:

int pg_utf8_verifychar_32(uint32 chunk)
{
intlen, l;

for (len = sizeof(chunk); len > 0; (len -= l), (chunk = chunk << l))
{
/* Is 2-byte lead */
if ((chunk & 0xF000) == 0xC000)
{
l = 2;
/* ...  ...  */
}
/* Is 3-byte lead */
else if ((chunk & 0xF000) == 0xE000)
{
l = 3;
if (len < l)
break;

/* b2 and b3 should be continuation bytes */
if ((chunk & 0x00C0C000) != 0x00808000)
return sizeof(chunk) - len;

switch (chunk & 0xFF20)
{
/* check 3-byte overlong: 1110. 1001. 10xx.
 * i.e. (b1 == 0xE0 && b2 < 0xA0). We already know b2
is of the form
 * 10xx since it's a continuation char. Additionally
condition b2 <=
 * 0x9F means it is of the form 100x..  i.e.
either 1000.
 * or 1001.. So just verify that it is xx0x.
 */
case 0xE000:
return sizeof(chunk) - len;

/* check surrogate: 1110.1101 101x. 10xx.
 * i.e. (b1 == 0xED && b2 > 0x9F): Here, > 0x9F means either
 * 1010., 1011., 1100., or 1110.. Last
two are not
 * possible because b2 is a continuation char. So it has to be
 * first two. So just verify that it is xx1x.
 */
case 0xED20:
return sizeof(chunk) - len;
default:
;
}

}
/* Is 4-byte lead */
else if ((chunk & 0xF000) == 0xF000)
{
/* .  */
l = 4;
}
else
return sizeof(chunk) - len;
}
return sizeof(chunk) - len;
}

Relaxing the sub-transaction restriction in parallel query

2021-06-13 Thread Amit Khandekar

Hi,

Currently we don't allow a sub-transaction to be spawned from inside a
parallel worker (and also from a leader who is in parallel mode). This
imposes a restriction that pl/pgsql functions that use an exception block
can't be marked parallel safe, even when the exception block is there only
to catch trivial errors such as divide-by-zero. I tried to look at the
implications of removing this sub-transaction restriction, and came up with
the attached WIP patch after considering the below points. I may have
missed other points, or may have assumed something wrong. So comments are
welcome.

- TBLOCK_PARALLEL_INPROGRESS

Now that there can be an in-progress sub-transaction in a parallel worker,
the sub-transaction states need to be accounted for. Rather than having new
transaction states such as TBLOCK_PARALLEL_SUBINPROGRESS, I removed the
existing TBLOCK_PARALLEL_INPROGRESS from the code. At a couple of places
is_parallel_worker is set if state is TBLOCK_PARALLEL_INPROGRESS. Instead,
for now, I have used (ParallelCurrentXids != NULL) to identify if it's a
worker in a valid transaction state. Maybe we can improve on this.
In EndTransactionBlock(), there is a fatal error thrown if it's a
parallel-worker in-progress. This seems to be a can't-have case. So I have
removed this check. Need to further think how we can retain this check.


- IsInParallelMode()

On HEAD, the parallel worker cannot have any sub-transactions, so
CurrentTransactionState always points to the TopTransactionStateData. And
when ParallelWorkerMain() calls EnterParallelMode(), from that point
onwards IsInParallelMode() always returns true. But with the patch,
CurrentTransactionState can point to some nest level down below, and in
that TransactionState, parallelModeLevel would be 0, so IsInParallelMode()
will return false in a sub-transaction, unless some other function happens
to explicitly call EnterParallelMode().

One option for making IsInParallelMode() always return true for worker is
to just check whether the worker is in a transaction (ParallelCurrentXids
!= NULL). Or else, check only the TopTransactionData->parallelModeLevel.
Still another option is for the new TransactionState to inherit
parallelModeLevel from it's parent. I chose this option. This avoids
additional conditions in IsInParallelMode() specifically for worker.

Does this inherit-parent-parallelModeLevel option affect the leader code ?
The functions calling EnterParallelMode() are : begin_parallel_vacuum,
_bt_begin_parallel, ParallelWorkerMain, CommitTransaction, ExecutePlan().
After entering Parallel mode, it does not look like a subtransaction will
be spawned at these places. If at all it does, on HEAD, the
IsInParallelMode() will return false, which does not sound right. For all
the child transactions, this function should return true. So w.r.t. this,
in fact inheriting parent's parallelModeLevel looks better.

Operations that are not allowed to run in worker would continue to be
disallowed in a worker sub-transaction as well. E.g. assigning new xid,
heap_insert, etc. These places already are using IsInParallelMode() which
takes care of guarding against such operations.

Just for archival ...
ExecutePlan() is called with use_parallel_mode=true when there was a gather
plan created, in which case it enters Parallel mode. From here, if the
backend happens to start a new subtransaction for some reason, it does
sound right for the Parallel mode to be true for this sub-transaction,
although I am not sure if there can be such a case.
In worker, as far as I understand, ExecutePlan() always gets called with
use_parallel_mode=false, because there is no gather plan in the worker. So
it does not enter Parallel mode. But because the worker is already in
parallel mode, it does not matter.


- List of ParallelContexts (pcxt_list) :
ParallelContext is created only in backends. So there are no implications
of the pcxt_list w.r.t. parallel workers spawning a subtransaction, because
pcxt_list will always be empty in workers.


- ParallelCurrentXids  :

A parallel worker always maintains a global flat sorted list of xids which
represent all the xids that are considered as current xids (i.e. the ones
that are returned by TransactionIdIsCurrentTransactionId() in a leader). So
this global list should continue to work no matter what is the
sub-transaction nest level, since there won't be new xids created in the
worker.

- Savepoints :

Haven't considered savepoints. The restriction is retained for savepoints.

Thanks
-Amit Khandekar
Huawei Technologies
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 441445927e..c46e32ba37 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -160,7 +160,6 @@ typedef enum TBlockState
 	TBLOCK_BEGIN,/* starting transaction block */
 	TBLOCK_INPROGRESS,			/* live transaction */
 	TBLOCK_IMPLICIT_INPROGRESS, /* live transaction after implicit BEGIN */
-	TBLOCK_PARALLEL_INPROGR

Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar

On Wed, 28 Apr 2021 at 13:54, David Rowley  wrote:

> So, given that I removed the parallel test in partition_prune.sql, and
> don't have any EXPLAIN ANALYZE output for parallel tests in
> resultcache.sql, it should be safe enough to put that cache_misses ==
> 0 test back into explain.c
>
> I've attached a patch to do this. The explain.c part is pretty similar
> to your patch, I just took my original code and comment.

Sounds good. And thanks for the cleanup patch, and the brief history.
Patch looks ok to me.

Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar

On Wed, 28 Apr 2021 at 16:14, David Rowley  wrote:
> However, I did add 1 test that sets work_mem down to 64kB to ensure
> the eviction code does get some exercise. You'll notice that I pass
> "true" to explain_resultcache() to hide the hits and misses there.  We
> can't test the exact number of hits/misses/evictions there, but we can
> at least tell apart the zero and non-zero by how I coded
> explain_resultcache() to replace with Zero or N depending on if the
> number was zero or above zero.

Thanks for the explanation. I did realize after replying to Bharat
upthread, that I was wrong in assuming that the cache misses and cache
hits are always stable for non-parallel scans.

Re: Result Cache node shows per-worker info even for workers not launched

2021-04-28 Thread Amit Khandekar

On Wed, 28 Apr 2021 at 15:08, Bharath Rupireddy
 wrote:
>
> On Wed, Apr 28, 2021 at 1:54 PM David Rowley  wrote:
> > I plan to push this in the next 24 hours or so.
>
> I happen to see explain_resultcache in resultcache.sql, seems like two
> of the tests still have numbers for cache hits and misses - Hits: 980
> Misses: 20, won't these make tests unstable? Will these numbers be
> same across machines? Or is it that no buildfarm had caught these? The
> comment below says that, the hits and misses are not same across
> machines:
> -- Ensure we get some evictions.  We're unable to validate the hits and misses
> -- here as the number of entries that fit in the cache at once will vary
> -- between different machines.
>
> Should we remove the hide_hitmiss parameter in explain_resultcache and
> always print N for non-zero and Zero for 0 hits and misses? This
> clearly shows that we have 0 or non-zero hits or misses.
>
> Am I missing something?

I believe, the assumption here is that with no workers involved, it is
guaranteed to have the exact same cache misses and hits anywhere.

Result Cache node shows per-worker info even for workers not launched

2021-04-27 Thread Amit Khandekar

Hi,

If planned parallel workers do not get launched, the Result Cache plan
node shows all-0 stats for each of those workers:

tpch=# set max_parallel_workers TO 0;
SET
tpch=# explain analyze
select avg(l_discount) from orders, lineitem
where
l_orderkey = o_orderkey
and o_orderdate < date '1995-03-09'
and l_shipdate > date '1995-03-09';


QUERY PLAN
-
 Finalize Aggregate  (cost=315012.87..315012.88 rows=1 width=32)
(actual time=27533.482..27533.598 rows=1 loops=1)
   ->  Gather  (cost=315012.44..315012.85 rows=4 width=32) (actual
time=27533.471..27533.587 rows=1 loops=1)
 Workers Planned: 4
 Workers Launched: 0
 ->  Partial Aggregate  (cost=314012.44..314012.45 rows=1
width=32) (actual time=27533.177..27533.178 rows=1 loops=1)
   ->  Nested Loop  (cost=0.44..309046.68 rows=1986303
width=4) (actual time=0.400..27390.835 rows=748912 loops=1)
 ->  Parallel Seq Scan on lineitem
(cost=0.00..154513.66 rows=4120499 width=12) (actual
time=0.044..7910.399 rows=16243662 loops=1)
   Filter: (l_shipdate > '1995-03-09'::date)
   Rows Removed by Filter: 13756133
 ->  Result Cache  (cost=0.44..0.53 rows=1
width=4) (actual time=0.001..0.001 rows=0 loops=16243662)
   Cache Key: lineitem.l_orderkey
   Hits: 12085749  Misses: 4157913  Evictions:
3256424  Overflows: 0  Memory Usage: 65537kB
   Worker 0:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 1:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 2:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   Worker 3:  Hits: 0  Misses: 0  Evictions: 0
 Overflows: 0  Memory Usage: 0kB
   ->  Index Scan using orders_pkey on orders
(cost=0.43..0.52 rows=1 width=4) (actual time=0.002..0.002 rows=0
loops=4157913)
 Index Cond: (o_orderkey = lineitem.l_orderkey)
 Filter: (o_orderdate < '1995-03-09'::date)
 Rows Removed by Filter: 1
 Planning Time: 0.211 ms
 Execution Time: 27553.477 ms
(22 rows)

By looking at the other cases like  show_sort_info() or printing
per-worker jit info, I could see that the general policy is that we
skip printing info for workers that are not launched. Attached is a
patch to do the same for Result Cache.

I was earlier thinking about using (instrument[n].nloops == 0) to
check for not-launched workers. But we are already using "if
(rcstate->stats.cache_misses == 0)" for the leader process, so for
consistency I used the same method for workers.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies


skip_notlaunched_workers.patch
Description: Binary data

Re: Speeding up GIST index creation for tsvectors

2021-03-19 Thread Amit Khandekar

On Thu, 11 Mar 2021 at 04:25, John Naylor  wrote:
> Okay, so it's hard-coded in various functions in contrib modules. In that
> case, let's just keep the existing coding for those. In fact, the comments
> that got removed by your patch specifically say: /* Using the popcount
> functions here isn't likely to win */ ...which your testing confirmed. The
> new function should be used only for Gist and possibly Intarray, whose
> default is 124. That means we can't get rid of hemdistsign(), but that's
> fine.

The comment is there for all types. Since I get the performance better
on all the types, I have kept the pg_xorcount() call for all these
contrib modules.  I understand that since for some types default
siglen is small, we won't get benefit. But I think we should call
pg_xorcount() for the benefit of non-default siglen case.

Have replaced hemdistsign() by pg_xorcount() everywhere; but with
that, the call looks a bit clumsy because of having to type-cast the
parameters to const unsigned char *.  I noticed that the cast to
"unsigned char *" is required only when we use the value in the
pg_number_of_ones array. Elsewhere we can safely use 'a' and 'b' as
"char *". So I changed the pg_xorcount() parameters from unsigned char
* to char *.

> I think the way to go is a simplified version of your 0001 (not 0002), with
> only a single function, for gist and intarray only, and a style that better
> matches the surrounding code. If you look at my xor functions in the attached
> text file, you'll get an idea of what it should look like. Note that it got
> the above performance without ever trying to massage the pointer alignment.
> I'm a bit uncomfortable with the fact that we can't rely on alignment, but
> maybe there's a simple fix somewhere in the gist code.

In the attached 0001-v3 patch, I have updated the code to match it
with surrounding code; specifically the usage of *a++ rather than
a[i].

Regarding the alignment changes... I have removed the code that
handled the leading identically unaligned bytes, for lack of evidence
that percentage of such cases is significant. Like I noted earlier,
for the tsearch data I used, identically unaligned cases were only 6%.
If I find scenarios where these cases can be significant after all and
if we cannot do anything in the gist index code, then we might have to
bring back the unaligned byte handling. I didn't get a chance to dig
deeper into the gist index implementation to see why they are not
always 8-byte aligned.

I have kept the 0002 patch as-is. Due to significant *additional*
speedup, over and above the 0001 improvement, I think the code
re-arrangement done is worth it for non-x86 platforms.

-Amit Khandekar
From c6ae575dd483b85cec6748c2e014d0f32565b4eb Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Fri, 19 Mar 2021 20:22:44 +0800
Subject: [PATCH 1/2] Speed up xor'ing of two gist index signatures for
 tsvectors

In hemdistsign(), rather than using xor operator on char values, use
it in 64-bit chunks. And since the chunks are 64-bit, use popcount64()
on each of the chunks. I have checked that the two bitvector pointer
arguments of hemdistsign() are not always 64-bit aligned. So do the
64-bit chunks only if both pointers are 8 byte-aligned.

This results in speed-up in Gist index creation for tsvectors. With
default siglen (124), the speed up is 12-20%. With siglen=700, it is
30-50%. So with longer signature lengths, we get higher percentage
speed-up.

Similar results are seen in other types using gist index, such as
intarray, hstore, and ltree that are availale in contrib.

With smaller siglens such as 10, 20 etc, there is a bit of a reduction
in speed by 1-7% if we use this optimization. It's probably because of
an extra function call for pg_xorcount(); and also might be due to
the extra logic in pg_xorcount(). So for siglen less than 32,
keep the existing method using byte-by-byte traversal.
---
 contrib/hstore/hstore_gist.c  | 17 ++--
 contrib/intarray/_intbig_gist.c   | 18 +---
 contrib/ltree/_ltree_gist.c   | 19 ++---
 src/backend/utils/adt/tsgistidx.c | 26 +--
 src/include/port/pg_bitutils.h| 17 
 src/port/pg_bitutils.c| 34 +++
 6 files changed, 61 insertions(+), 70 deletions(-)

diff --git a/contrib/hstore/hstore_gist.c b/contrib/hstore/hstore_gist.c
index 102c9cea72..4970a0a2f0 100644
--- a/contrib/hstore/hstore_gist.c
+++ b/contrib/hstore/hstore_gist.c
@@ -8,6 +8,7 @@
 #include "access/stratnum.h"
 #include "catalog/pg_type.h"
 #include "hstore.h"
+#include "port/pg_bitutils.h"
 #include "utils/pg_crc.h"

 /* gist_hstore_ops opclass options */
@@ -256,20 +257,6 @@ sizebitvec(BITVECP sign, int siglen)
 	return size;
 }

-static int
-hemdistsign(BITVECP a, BITVECP b, int siglen)
-{
-	int			i,
-dist = 0;
-
-	L

Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-12 Thread Amit Khandekar

On Tue, 9 Mar 2021 at 17:14, John Naylor  wrote:
> On Tue, Mar 9, 2021 at 5:00 AM Amit Khandekar  wrote:
> > Just a quick question before I move on to review the patch ... The
> > improvement looks like it is only meant for x86 platforms.
>
> Actually it's meant to be faster for all platforms, since the C fallback is 
> quite a bit different from HEAD. I've found it to be faster on ppc64le. An 
> earlier version of the patch was a loser on 32-bit Arm because of alignment 
> issues, but if you could run the test script attached to [1] on 64-bit Arm, 
> I'd be curious to see how it does on 0002, and whether 0003 and 0004 make 
> things better or worse. If there is trouble building on non-x86 platforms, 
> I'd want to fix that also.

On my Arm64 VM :

HEAD :
 mixed | ascii
---+---
  1091 |   628
(1 row)

PATCHED :
 mixed | ascii
---+---
   681 |   119

So the fallback function does show improvements on Arm64.

I guess, if at all we use the equivalent Arm NEON intrinsics, the
"mixed" figures will be close to the "ascii" figures, going by your
figures on x86.

> > Can this be
> > done in a portable way by arranging for auto-vectorization ? Something
> > like commit 88709176236caf. This way it would benefit other platforms
> > as well.
>
> I'm fairly certain that the author of a compiler capable of doing that in 
> this case would be eligible for some kind of AI prize. :-)

:)

I was not thinking about auto-vectorizing the code in
pg_validate_utf8_sse42(). Rather, I was considering auto-vectorization
inside the individual helper functions that you wrote, such as
_mm_setr_epi8(), shift_right(), bitwise_and(), prev1(), splat(),
saturating_sub() etc. I myself am not sure whether it is feasible to
write code that auto-vectorizes all these function definitions.
saturating_sub() seems hard, but I could see the gcc docs mentioning
support for generating such instructions for a particular code loop.
But for the index lookup function() it seems impossible to generate
the  needed index lookup intrinsics. We can have platform-specific
function definitions for such exceptional cases.

I am considering this only because that would make the exact code work
on other platforms like arm64 and ppc, and won't have to have
platform-specific files. But I understand that it is easier said than
done. We will have to process the loop in pg_validate_utf8_sse42() in
128-bit chunks, and pass each chunk to individual functions, which
could mean extra work and extra copy in extracting the chunk data and
passing it around, which may make things drastically slow. You are
passing around the chunks using __m128i type, so perhaps it means
passing around just a reference to the simd registers. Not sure.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: [POC] verifying UTF-8 using SIMD instructions

2021-03-09 Thread Amit Khandekar

Hi,

Just a quick question before I move on to review the patch ... The
improvement looks like it is only meant for x86 platforms. Can this be
done in a portable way by arranging for auto-vectorization ? Something
like commit 88709176236caf. This way it would benefit other platforms
as well.

I tried to compile the following code using -O3, and the assembly does
have vectorized instructions.

#include 
int main()
{
int i;
char s1[200] = "abcdewhruerhetr";
char s2[200] = "oweurietiureuhtrethre";
char s3[200] = {0};

for (i = 0; i < sizeof(s1); i++)
{
s3[i] = s1[i] ^ s2[i];
}

printf("%s\n", s3);
}

Re: Speeding up GIST index creation for tsvectors

2021-03-08 Thread Amit Khandekar

On Wed, 3 Mar 2021 at 23:32, John Naylor  wrote:
> Your performance numbers look like this is a fruitful area to improve. I have 
> not yet tested performance, but I will do so at a later date.

Thanks for reviewing the patch !

> I did some
> microbenchmarking of our popcount implementation, since I wasn't quite sure
> it's optimal, and indeed, there is room for improvement there [1]. I'd be
> curious to hear your thoughts on those findings. I think it'd be worth it to
> test a version of this patch using those idioms here as well, so at some
> point I plan to post something.

I am not yet clear about the implications of that work on this patch
set here, but I am still going over it, and will reply to that.

>
> Now for the patches:
>
> 0001:
>
> + /*
> + * We can process 64-bit chunks only if both are mis-aligned by the same
> + * number of bytes.
> + */
> + if (b_aligned - b == a_aligned - a)
>
> The obvious question here is: how often are they identically misaligned? You
> don't indicate that your measurements differ in a bimodal fashion, so does
> that mean they happen to be always (mis)aligned?

I ran CREATE INDEX on tsvector columns using the tsearch.data that I
had attached upthread, with some instrumentation; here are the
proportions :
1. In 15% of the cases, only one among a and b was aligned. The other
was offset from the 8-byte boundary by 4 bytes.
2. 6% of the cases, both were offset by 4 bytes, i.e. identically misaligned.
3. Rest of the cases, both were aligned.

With other types, and with different sets of data, I believe I can get
totally different proportions.

> Is that an accident of the gist coding and could change unexpectedly?
> And how hard would it be to
> allocate those values upstream so that the pointers are always aligned on
> 8-byte boundaries? (I imagine pretty hard, since there are multiple callers,
> and such tight coupling is not good style)

That I am not sure though; haven't clearly understood the structure of
gist indexes yet. I believe it would depend on individual gist
implementation, and we can't assume about that ?

> + /* For smaller lengths, do simple byte-by-byte traversal */
> + if (bytes <= 32)
>
> You noted upthread:
>
> > Since for the gist index creation of some of these types the default
> > value for siglen is small (8-20), I tested with small siglens. For
> > siglens <= 20, particularly for values that are not multiples of 8
> > (e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
> > creation. It's probably because of
> > an extra function call for pg_xorcount(); and also might be due to the
> > extra logic in pg_xorcount() which becomes prominent for shorter
> > traversals. So for siglen less than 32, I kept the existing method
> > using byte-by-byte traversal.
>
> I wonder if that can be addressed directly, while cleaning up the loops and
> alignment checks in pg_xorcount_long() a little. For example, have a look at
> pg_crc32c_armv8.c -- it might be worth testing a similar approach.

Yeah, we can put the bytes <= 32 condition inside pg_xorcount_long().
I avoided that to not hamper the <= 32 scenarios. Details explained
below for "why inline pg_xorcount is calling global function"

> Also, pardon my ignorance, but where can I find the default siglen for 
> various types?
Check SIGLEN_DEFAULT.

>
> +/* Count the number of 1-bits in the result of xor operation */
> +extern uint64 pg_xorcount_long(const unsigned char *a, const unsigned char 
> *b,
> +   int bytes);
> +static inline uint64 pg_xorcount(const unsigned char *a, const unsigned char 
> *b,
> + int bytes)
>
> I don't think it makes sense to have a static inline function call a global 
> function.

As you may note, the global function will be called only in a subset
of cases where siglen <= 32. Yeah, we can put the bytes <= 32
condition inside pg_xorcount_long(). I avoided that to not hamper the
<= 32 scenarios. If I do this, it will add a function call for these
small siglen scenarios. The idea was: use the function call only for
cases where we know that the function call overhead will be masked by
the popcount() optimization.

>
> -static int
> +static inline int
>  hemdistsign(BITVECP a, BITVECP b, int siglen)
>
>  Not sure what the reason is for inlining all these callers.
> Come to think of it, I don't see a reason to have hemdistsign()
> at all anymore. All it does is call pg_xorcount(). I suspect that's
> what Andrey Borodin meant when he said upthread:
>
>  > > > Meanwhile there are at least 4 incarnation of hemdistsign()
> > > > functions that are quite similar. I'd propose to refactor them 
> > > > somehow...

I had something in mind when I finally decided to not remove
hemdistsign(). Now I think you are right, we can remove hemdistsign()
altogether. Let me check again.

> 0002:
>
> I'm not really happy with this patch. I like the idea of keeping indirect
> calls out of non-x86 platforms, but I believe it could be done more simply.

I am open for

Re: Speeding up GIST index creation for tsvectors

2021-02-28 Thread Amit Khandekar

I have added this one in the March commitfest.
https://commitfest.postgresql.org/32/3023/

Re: Speeding up GIST index creation for tsvectors

2021-01-27 Thread Amit Khandekar

On Tue, 15 Dec 2020 at 20:34, Amit Khandekar  wrote:
>
> On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin  wrote:
> > +1
> > This will make all INSERTs and UPDATES for tsvector's GiSTs.
>
> Oh, I didn't realize that this code is getting used in GIST index
> insertion and creation too. Will check there.

I ran some insert and update tests; they show only marginal
improvement. So looks like the patch is mainly improving index builds.

> > Meanwhile there are at least 4 incarnation of hemdistsign() functions that 
> > are quite similar. I'd propose to refactor them somehow...
>
> Yes, I hope we get the benefit there also. Before that, I thought I
> should post the first use-case to get some early comments. Thanks for
> your encouraging comments :)

The attached v2 version of 0001 patch extends the hemdistsign()
changes to the other use cases like  intarray, ltree and hstore. I see
the same index build improvement for all these types.

Since for the gist index creation of some of these types the default
value for siglen is small (8-20), I tested with small siglens. For
siglens <= 20, particularly for values that are not multiples of 8
(e.g. 10, 13, etc), I see a  1-7 % reduction in speed of index
creation. It's probably because of
an extra function call for pg_xorcount(); and also might be due to the
extra logic in pg_xorcount() which becomes prominent for shorter
traversals. So for siglen less than 32, I kept the existing method
using byte-by-byte traversal.

--
Thanks,
-Amit Khandekar
Huawei Technologies
From b8905b47f7e0af8d4558fdac73cc2283c263cbcc Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Thu, 10 Dec 2020 21:45:49 +0800
Subject: [PATCH 2/2] Avoid function pointer dereferencing for
 pg_popcount32/64() call.

With USE_POPCNT_ASM set (which is true only on x86), we decide with
the help of __get_cpuid() at runtime whether the CPU supports popcnt
instruction, and hence we dynamically choose the corresponding
function; thus pg_popcount32/64 is a function pointer. But for
platforms with USE_POPCNT_ASM not set, we don't have to use dynamic
assignment of functions, but we were still declaring pg_popcount64 as a
function pointer. So arrange for direct function call so as to get rid
of function pointer dereferencing each time pg_popcount32/64 is
called.

To do this, define pg_popcount64 to another function name
(pg_popcount64_nonasm) rather than a function pointer, whenever
USE_POPCNT_ASM is not defined.  And let pg_popcount64_nonasm() be a
static inline function so that whenever pg_popcount64() is called,
the __builtin_popcount() gets called directly.  For platforms not
supporting __builtin_popcount(), continue using the slow version as is
the current behaviour. pg_popcount64_slow() was actually a misnomer,
because it was actually calling __builtin_popcount() which is not a slow
function. Now with this commit, pg_popcount64_slow() indeed has only
slow code.

Tested this on ARM64, and the gist index creation for tsvectors speeds
up by ~ 6% for default siglen (=124), and by 12% with siglen=700
---
 src/include/port/pg_bitutils.h | 59 ++
 src/port/pg_bitutils.c | 47 +--
 2 files changed, 67 insertions(+), 39 deletions(-)

diff --git a/src/include/port/pg_bitutils.h b/src/include/port/pg_bitutils.h
index 174df28e66..b039f94ee5 100644
--- a/src/include/port/pg_bitutils.h
+++ b/src/include/port/pg_bitutils.h
@@ -23,6 +23,19 @@ extern const uint8 pg_rightmost_one_pos[256];
 extern const uint8 pg_number_of_ones[256];
 #endif

+/*
+ * On x86_64, we can use the hardware popcount instruction, but only if
+ * we can verify that the CPU supports it via the cpuid instruction.
+ *
+ * Otherwise, we fall back to __builtin_popcount if the compiler has that,
+ * or a hand-rolled implementation if not.
+ */
+#ifdef HAVE_X86_64_POPCNTQ
+#if defined(HAVE__GET_CPUID) || defined(HAVE__CPUID)
+#define USE_POPCNT_ASM 1
+#endif
+#endif
+
 /*
  * pg_leftmost_one_pos32
  *		Returns the position of the most significant set bit in "word",
@@ -208,8 +221,54 @@ pg_ceil_log2_64(uint64 num)
 }

 /* Count the number of one-bits in a uint32 or uint64 */
+
+/*
+ * With USE_POPCNT_ASM set (which is true only on x86), we decide at runtime
+ * whether the CPU supports popcnt instruction, and hence we dynamically choose
+ * the corresponding function; thus pg_popcount32/64 is a function pointer. But
+ * for platforms with USE_POPCNT_ASM not set, we don't have to use dynamic
+ * assignment of functions, so we arrange for direct function call so as to get
+ * rid of function pointer dereferencing each time pg_popcount32/64 is called.
+ */
+#ifdef USE_POPCNT_ASM
 extern int	(*pg_popcount32) (uint32 word);
 extern int	(*pg_popcount64) (uint64 word);
+#else
+#define pg_popcount32 pg_popcount32_nonasm
+#define pg_popcount64 pg_popcount64_nonasm
+#endif
+
+/* Slow versions of pg_popcount */
+#ifndef HAVE__BUILTIN_P

Re: Speeding up GIST index creation for tsvectors

2020-12-15 Thread Amit Khandekar

On Sun, 13 Dec 2020 at 9:28 PM, Andrey Borodin  wrote:
> +1
> This will make all INSERTs and UPDATES for tsvector's GiSTs.

Oh, I didn't realize that this code is getting used in GIST index
insertion and creation too. Will check there.

> Also I really like idea of taking advantage of hardware capabilities like 
> __builtin_* etc wherever possible.

Yes. Also, the __builtin_popcount() uses SIMD vectorization (on arm64
: "cnt  v0.8b, v0.8b"), hence there's all the more reason to use it.
Over and above that, I had thought that if we can auto-vectorize the
byte-by-byte xor operation and the popcount() call using compiler
optimizations, we would benefit out of this, but didn't see any more
improvement. I hoped for the benefit because that would have allowed
us to process in 128-bit chunks or 256-bit chunks, since the vector
registers are at least that long. Maybe gcc is not that smart to
translate __builtin_popcount() to 128/256 bit vectorized instruction.
But for XOR operator, it does translate to 128bit vectorized
instructions (on arm64 : "eor  v2.16b, v2.16b, v18.16b")

> Meanwhile there are at least 4 incarnation of hemdistsign() functions that 
> are quite similar. I'd propose to refactor them somehow...

Yes, I hope we get the benefit there also. Before that, I thought I
should post the first use-case to get some early comments. Thanks for
your encouraging comments :)

Re: Speeding up GIST index creation for tsvectors

2020-12-13 Thread Amit Khandekar

On Thu, 10 Dec 2020 at 20:43, Pavel Borisov  wrote:
>
> Hi, Amit!
> It's really cool to hear about another GiST improvement proposal. I'd like to 
> connect recently committed GiST ordered build discussion here [1] and further 
> improvement proposed [2]
>
> I've tested feature [1] and got 2.5-3 times speed improvement which is much 
> better I believe.

Yeah, I am completely new to the GIST stuff, but I had taken a quick
look at the sortsupport feature for GIST, and found it very
interesting. I believe it's an additional option for making the gist
index builds much faster. But then I thought that my small patch would
still be worthwhile because for tsvector types the non-sort method for
index build would continue to be used by users, and in general we can
extend this small optimization for other gist types also.

> There is an ongoing activity [2] to build support for different data types 
> for GiST. Maybe you will consider it interesting to join.
>
> BTW you may have heard about Gin and Rum [3] indexes which suit text search 
> much, much better (and faster) than GiST. The idea to process data in bigger 
> chunks is good. Still optimize index structure, minimizing disc pages access, 
> etc. seems better in many cases.

Sure. Thanks for the pointers.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Improving spin-lock implementation on ARM.

2020-12-06 Thread Amit Khandekar

On Sat, 5 Dec 2020 at 02:55, Alexander Korotkov  wrote:
>
> On Wed, Dec 2, 2020 at 6:58 AM Krunal Bauskar  wrote:
> > Let me know what do you think about this analysis and any specific 
> > direction that we should consider to help move forward.
>
> BTW, it would be also nice to benchmark my lwlock patch on the
> Kunpeng.  I'm very optimistic about this patch, but it wouldn't be
> fair to completely throw it away.  It still might be useful for
> LSE-disabled builds.

Coincidentally I was also looking at some hotspot locations around
LWLockAcquire() and LWLockAttempt() for read-only work-loads on both
arm64 and x86, and had narrowed down to the place where
pg_atomic_compare_exchange_u32() is called. So it's likely we are
working on the same hotspot area. When I get a chance, I do plan to
look at your patch while I myself am trying to see if we can do some
optimizations. Although, this is unrelated to the optimization of this
mail thread, so this will need a different mail thread.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Improving spin-lock implementation on ARM.

2020-12-01 Thread Amit Khandekar

On Tue, 1 Dec 2020 at 15:33, Krunal Bauskar  wrote:
> What I meant was outline-atomics support was added in GCC-9.4 and was made 
> default in gcc-10.
> LSE support is present for quite some time.

FWIW, here is an earlier discussion on the same (also added the
proposal author here) :

https://www.postgresql.org/message-id/flat/099F69EE-51D3-4214-934A-1F28C0A1A7A7%40amazon.com

Re: Improving spin-lock implementation on ARM.

2020-11-25 Thread Amit Khandekar

On Thu, 26 Nov 2020 at 10:55, Krunal Bauskar  wrote:
> Hardware: ARM Kunpeng 920 BareMetal Server 2.6 GHz. 64 cores (56 cores for 
> server and 8 for client) [2 numa nodes]
> Storage: 3.2 TB NVMe SSD
> OS: CentOS Linux release 7.6
> PGSQL: baseline = Release Tag 13.1
> Invocation suite: 
> https://github.com/mysqlonarm/benchmark-suites/tree/master/pgsql-pbench (Uses 
> pgbench)

Using the same hardware, attached are my improvement figures, which
are pretty much in line with your figures. Except that, I did not run
for more than 400 number of clients. And, I am getting some
improvement even for select-only workloads, in case of 200-400
clients. For read-write load,  I had seen that the s_lock() contention
was caused when the XLogFlush() uses the spinlock. But for read-only
case, I have not analyzed where the improvement occurred.

The .png files in the attached tar have the graphs for head versus patch.

The GUCs that I changed :

work_mem=64MB
shared_buffers=128GB
maintenance_work_mem = 1GB
min_wal_size = 20GB
max_wal_size = 100GB
checkpoint_timeout = 60min
checkpoint_completion_target = 0.9
full_page_writes = on
synchronous_commit = on
effective_io_concurrency = 200
log_checkpoints = on

For backends, 64 CPUs were allotted (covering 2 NUMA nodes) , and for
pgbench clients a separate set of 28 CPUs were allotted on a different
socket. Server was pre_warmed().

results.tar.gz
Description: application/gzip

Re: Redundant tuple copy in tqueueReceiveSlot()

2020-09-18 Thread Amit Khandekar

On Thu, 17 Sep 2020 at 08:55, Andres Freund  wrote:
>
> Hi,
>
> On 2020-09-17 14:20:50 +1200, Thomas Munro wrote:
> > I wonder if there is a way we could make "Minimal Tuples but with the
> > length travelling separately (and perhaps chopped off?)" into a
> > first-class concept...  It's also a shame to be schlepping a bunch of
> > padding bytes around.

Yeah, I think we can pass a  "length" data separately, but since the
receiver end already is assuming that it knows the received data is a
minimal tuple, I thought why not skip passing this redundant
component. But anyways, if you and Andres are suggesting that being
able to skip the copy is important for virtual tuples as well, then I
think the approach you suggested (supplying an allocated memory to the
tuple API for conversion) would be one of the better options with us,
if not the only good option. Maybe I will try looking into the shm_mq
working to see if we can come up with a good solution.

>
> There really is no justification for having MinimalTuples, as we have
> them today at least, anymore. We used to rely on being able to construct
> pointers to MinimalTuples that are mostly compatible with HeapTuple. But
> I think there's none of those left since PG 12.

Ah ok.

>
> I think it'd make a bit more sense to do some steps towards having a
> more suitable "minimal" tuple representation, rather than doing this
> local, pretty ugly, hacks. A good way would be to just starting to
> remove the padding, unnecessary fields etc from MinimalTuple.

So there are two things we wish to do :
1. Prevent an extra tuple forming step before sending minimal tuple
data. Possibly device an shm_mq API to get memory to write tuple of a
given length, and device something like
FormMinimalTupleDataInHere(memory_allocated_by_shm_mq) which will
write minimal tuple data.
2. Shrink the MinimalTupleData structure because it no longer needs
the current padding etc and we can substitute this new MinimalTuple
structure with the current one all over the code wherever it is
currently being used.

If we remove the unnecessary fields from the tuple data being sent to
Gather node, then we need to again form a MinimalTuple at the
receiving end, which again adds an extra tuple forming. So I
understand, that's the reason why you are saying we should shrink the
MinimalTupleData structure itself, in which case we will continue to
use the received new MinimalTupledata as an already-formed tuple, like
how we are doing now.

Now, the above two things (1. and 2.) look independent to me. Suppose
we first do 1. i.e. we come up with a good way to form an in-place
MinimalTuple at the sender's end, without any change to the
MinimalTupleData. And then when we do 2. i.e. shrink the
MinimalTupleData; but for that, we won't require any change in the
in-place-tuple-forming API we wrote in 1. . Just the existing
underlying function heap_form_minimal_tuple() or something similar
might need to be changed. At least that's what I feel right now.

>
> I also think that it'd be good to look at a few of the other places that
> are heavily bottlenecked by MinimalTuple overhead before designing new
> API around this. IIRC it's e.g. very easy to see hash joins spending a
> lot of time doing MinimalTuple copies & conversions.

Yeah, makes sense. The above FormMinimalTupleDataInHere() should be
able to be used for these other places as well. Will keep that in
mind.

>
> > > Thomas, I guess you had a different approach in mind when you said
> > > about "returning either success or
> > > hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
> >
> > Yeah I tried some things like that, but I wasn't satisfied with any of
> > them; basically the extra work involved in negotiating the size was a
> > bit too high.

Hmm, ok. Let me see if there is any way around this.

>> On the other hand, because my interface was "please
> > write a MinimalTuple here!", it had the option to *form* a
> > MinimalTuple directly in place, whereas your approach can only avoid
> > creating and destroying a temporary tuple when the source is a heap
> > tuple.
True.
>
> There's a lot of cases where the source is a virtual slot (since we'll
> often project stuff below Gather). So it'd be quite advantageous to
> avoid building an unnecessary HeapTuple in those cases.

Yeah right.

Re: calling procedures is slow and consumes extra much memory against calling function

2020-09-17 Thread Amit Khandekar

On Thu, 17 Sep 2020 at 11:07, Michael Paquier  wrote:
>
> On Thu, Jul 16, 2020 at 09:08:09PM +0200, Pavel Stehule wrote:
> > I am sending another patch that tries to allow CachedPlans for CALL
> > statements. I think this patch is very accurate, but it is not nice,
> > because it is smudging very precious reference counting for CachedPlans.
>
> Amit, you are registered as a reviewer of this patch for two months
> now.  Are you planning to look at it more?  If you are not planning to
> do so, that's fine, but it may be better to remove your name as
> reviewer then.

Thanks Michael for reminding. I *had* actually planned to do some more
review. But I think I might end up not getting bandwidth for this one
during this month. So I have removed my name. But I have kept my name
as reviewer for bitmaps and correlation :
"https://commitfest.postgresql.org/29/2310/ since I do plan to do some
review on that one.

Thanks,
-Amit Khandekar
Huawei Technologies

Redundant tuple copy in tqueueReceiveSlot()

2020-09-08 Thread Amit Khandekar

Hi,

I am starting a new thread that continues with the following point
that was discussed in [1] 

On Fri, 17 Jul 2020 at 09:05, Thomas Munro  wrote:
>
> On Sun, Jul 12, 2020 at 7:25 AM Soumyadeep Chakraborty
>  wrote:
> > Do you mean that we should have an implementation for
> > get_minimal_tuple() for the heap AM and have it return a pointer to the
> > minimal tuple from the MINIMAL_TUPLE_OFFSET? And then a caller such as
> > tqueueReceiveSlot() will ensure that the heap tuple from which it wants
> > to extract the minimal tuple was allocated in the tuple queue in the
> > first place? If we consider that the node directly below a gather is a
> > SeqScan, then we could possibly, in ExecInitSeqScan() set-up the
> > ss_ScanTupleSlot to point to memory in the shared tuple queue?
> > Similarly, other ExecInit*() methods can do the same for other executor
> > nodes that involve parallelism? Of course, things would be slightly
> > different for
> > the other use cases you mentioned (such as hash table population)
>
> What I mean is that where ExecHashTableInsert() and
> tqueueReceiveSlot() do ExecFetchSlotMinimalTuple(), you usually get a
> freshly allocated copy, and then you copy that again, and free it.
> There may be something similar going on in tuplestore and sort code.
> Perhaps we could have something like
> ExecFetchSlotMinimalTupleInPlace(slot, output_buffer,
> output_buffer_size) that returns a value that indicates either success
> or hey-that-buffer's-too-small-I-need-N-bytes, or something like that.
> That silly extra copy is something Andres pointed out to me in some
> perf results involving TPCH hash joins, a couple of years ago.

I went ahead and tried doing this. I chose an approach where we can
return the pointer to the in-place minimal tuple data if it's a
heap/buffer/minimal tuple slot. A new function
ExecFetchSlotMinimalTupleData() returns in-place minimal tuple data.
If it's neither heap, buffer or minimal tuple, it returns a copy as
usual. The receiver should not assume the data is directly taken from
MinimalTupleData, so it should set it's t_len to the number of bytes
returned. Patch attached
(0001-Avoid-redundant-tuple-copy-while-sending-tuples-to-G.patch)

Thomas, I guess you had a different approach in mind when you said
about "returning either success or
hey-that-buffer's-too-small-I-need-N-bytes".  But what looks clear to
me is that avoiding the copy shows consistent improvement of 4 to 10%
for simple parallel table scans. I tried my patch on both x86_64 and
arm64, and observed this speedup on both.

create table tab as select generate_series(1, 2000) id,
'abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyz' v;
select pg_prewarm('tab'::regclass);
explain analyze select * from tab where id %2 = 0;
Times in milli-secs :
HEAD : 1833.119  1816.522  1875.648  1872.153  1834.383
Patch'ed : 1763.786  1721.160  1714.665  1719.738  1704.478
This was with the default 2 parallel workers. With 3 or 4 workers, for
the above testcase I didn't see a noticeable difference. I think, if I
double the number of rows, the difference will be noticeable. In any
case, the gain would go on reducing with the number of workers,
because the tuple copy also gets parallelized. In some scenarios,
parallel_leader_participation=off causes the difference to amplify.

Haven't had a chance to see if this helps any of the TPC-H queries.

Also attached is a patch guc_for_testing.patch that I used for testing
the gain. This patch is only for testing. Without this, in order to
compare the performance figures it requires server restart, and the
figures anyway shift back and forth by 5-15 percent after each
restart, which creates lot of noise when comparing figures with and
without fix. Use this GUC enable_fix to enable/disable the fix.

[1] 
https://www.postgresql.org/message-id/CA%2BhUKGLrN2M18-hACEJbNoj2sn_WoUj9rkkBeoPK7SY427pAnA%40mail.gmail.com

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 5d19626e35e50a5630e5f1a042f7ecee6acb7c70 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Wed, 9 Sep 2020 12:03:01 +0800
Subject: [PATCH 2/2] Add guc for ease of testing speedup.

This is only for testing the performance gain. Otherwise, to compare
the performance figures, it requires server restart, and the figures
anyway shift back and forth by 5-15 percent after each restart, which
creates lot of noise when comparing figures with and without fix.

With this, we can easily see at least 4-10% difference in execution
times by setting/unsetting the GUC enable_fix.
---
 src/backend/executor/tqueue.c | 12 
 src/backend/utils/misc/guc.c  | 12 
 2 files changed, 24 insertions(+)

diff --git a/src/backend/executor/tqueue.c b/src/backend/executor/tqueue.c
index c70ee0f39a..7dd60b5c7e 100644
--- a/src/backend/executor/tqueue.c
+++ b/src/backend/executor/tqueue.c
@@

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-09-07 Thread Amit Khandekar

On Tue, 8 Sep 2020 at 02:19, Tom Lane  wrote:
>
> I wrote:
> > I experimented with a few different ideas such as adding restrict
> > decoration to the pointers, and eventually found that what works
> > is to write the loop termination condition as "i2 < limit"
> > rather than "i2 <= limit".  It took me a long time to think of
> > trying that, because it seemed ridiculously stupid.  But it works.

Ah ok.

I checked the "Auto-Vectorization in LLVM" link that you shared. All
the examples use "< n" or "> n". None of them use "<= n". Looks like a
hidden restriction.

>
> I've done more testing and confirmed that both gcc and clang can
> vectorize the improved loop on aarch64 as well as x86_64.  (clang's
> results can be confusing because -ftree-vectorize doesn't seem to
> have any effect: its vectorizer is on by default.  But if you use
> -fno-vectorize it'll go back to the old, slower code.)
>
> The only buildfarm effect I've noticed is that locust and
> prairiedog, which are using nearly the same ancient gcc version,
> complain
>
> c1: warning: -ftree-vectorize enables strict aliasing. -fno-strict-aliasing 
> is ignored when Auto Vectorization is used.
>
> which is expected (they say the same for checksum.c), but then
> there are a bunch of
>
> warning: dereferencing type-punned pointer will break strict-aliasing rules
>
> which seems worrisome.  (This sort of thing is the reason I'm
> hesitant to apply higher optimization levels across the board.)
> Both animals pass the regression tests anyway, but if any other
> compilers treat -ftree-vectorize as an excuse to apply stricter
> optimization assumptions, we could be in for trouble.
>
> I looked closer and saw that all of those warnings are about
> init_var(), and this change makes them go away:
>
> -#define init_var(v)MemSetAligned(v, 0, sizeof(NumericVar))
> +#define init_var(v)memset(v, 0, sizeof(NumericVar))
>
> I'm a little inclined to commit that as future-proofing.  It's
> essentially reversing out a micro-optimization I made in d72f6c750.
> I doubt I had hard evidence that it made any noticeable difference;
> and even if it did back then, modern compilers probably prefer the
> memset approach.

Thanks. I must admit it did not occur to me that I could have very
well installed clang on my linux machine and tried compiling this
file, or tested with some older gcc versions. I think I was using gcc
8. Do you know what was the gcc compiler version that gave these
warnings ?

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-09-07 Thread Amit Khandekar

On Mon, 7 Sep 2020 at 11:23, Tom Lane  wrote:
>
> I wrote:
> > I made some cosmetic changes to this and committed it.

Thanks!

>
> BTW, poking at this further, it seems that the patch only really
> works for gcc.  clang accepts the -ftree-vectorize switch, but
> looking at the generated asm shows that it does nothing useful.
> Which is odd, because clang does do loop vectorization.
>
> I tried adding -Rpass-analysis=loop-vectorize and got
>
> numeric.c:8341:3: remark: loop not vectorized: could not determine number of 
> loop iterations [-Rpass-analysis=loop-vectorize]
> for (i2 = 0; i2 <= i; i2++)

Hmm, yeah that's unfortunate. My guess is that the compiler would do
vectorization only if 'i' is a constant, which is not true for our
case.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Re: [HACKERS] Custom compression methods

2020-08-30 Thread Amit Khandekar

On Thu, 13 Aug 2020 at 17:18, Dilip Kumar  wrote:
> I have rebased the patch on the latest head and currently, broken into 3 
> parts.
>
> v1-0001: As suggested by Robert, it provides the syntax support for
> setting the compression method for a column while creating a table and
> adding columns.  However, we don't support changing the compression
> method for the existing column.  As part of this patch, there is only
> one built-in compression method that can be set (pglz).  In this, we
> have one in-build am (pglz) and the compressed attributes will directly
> store the oid of the AM.  In this patch, I have removed the
> pg_attr_compresion as we don't support changing the compression
> for the existing column so we don't need to preserve the old
> compressions.
> v1-0002: Add another built-in compression method (zlib)
> v1:0003: Remaining patch set (nothing is changed except rebase on the
> current head, stabilizing check-world  and  0001 and 0002 are pulled
> out of this)
>
> Next, I will be working on separating out the remaining patches as per
> the suggestion by Robert.

Thanks for this new feature. Looks promising and very useful, with so
many good compression libraries already available.

I see that with the patch-set, I would be able to create an extension
that defines a PostgreSQL C handler function which assigns all the
required hook function implementations for compressing, decompressing
and validating, etc. In short, I would be able to use a completely
different compression algorithm to compress toast data if I write such
an extension. Correct me if I am wrong with my interpretation.

Just a quick superficial set of review comments 

A minor re-base is required due to a conflict in a regression test

-

In heap_toast_insert_or_update() and in other places, the comments for
new parameter preserved_am_info are missing.

-

+toast_compress_datum(Datum value, Oid acoid)
 {
struct varlena *tmp = NULL;
int32 valsize;
-   CompressionAmOptions cmoptions;
+   CompressionAmOptions *cmoptions = NULL;

I think tmp and cmoptions need not be initialized to NULL

-

- TOAST_COMPRESS_SET_RAWSIZE(tmp, valsize);
- SET_VARSIZE_COMPRESSED(tmp, len + TOAST_COMPRESS_HDRSZ);
  /* successful compression */
+ toast_set_compressed_datum_info(tmp, amoid, valsize);
  return PointerGetDatum(tmp);

Any particular reason why is this code put in a new extern function ?
Is there a plan to re-use it ? Otherwise, it's not necessary to do
this.



Also, not sure why "HTAB *amoptions_cache" and "MemoryContext
amoptions_cache_mcxt" aren't static declarations. They are being used
only in toast_internals.c

---

The tab-completion doesn't show COMPRESSION :
postgres=# create access method my_method TYPE
INDEX  TABLE
postgres=# create access method my_method TYPE

Also, the below syntax also would better be tab-completed  so as to
display all the installed compression methods, in line with how we
show all the storage methods like plain,extended,etc:
postgres=# ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION



I could see the differences in compression ratio, and the compression
and decompression speed when I use lz versus zib :

CREATE TABLE zlibtab(t TEXT COMPRESSION zlib WITH (level '4'));
create table lztab(t text);
ALTER TABLE lztab ALTER COLUMN t SET COMPRESSION pglz;

pgg:s2:pg$ time psql -c "\copy zlibtab from text.data"
COPY 13050

real0m1.344s
user0m0.031s
sys 0m0.026s

pgg:s2:pg$ time psql -c "\copy lztab from text.data"
COPY 13050

real0m2.088s
user0m0.008s
sys 0m0.050s


pgg:s2:pg$ time psql -c "select pg_table_size('zlibtab'::regclass),
pg_table_size('lztab'::regclass)"
 pg_table_size | pg_table_size
---+---
   1261568 |   1687552

pgg:s2:pg$ time psql -c "select NULL from zlibtab where t like ''"
 > /dev/null

real0m0.127s
user0m0.000s
sys 0m0.002s

pgg:s2:pg$ time psql -c "select NULL from lztab where t like ''"
> /dev/null

real0m0.050s
user0m0.002s
sys 0m0.000s


-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-07-21 Thread Amit Khandekar

On Mon, 13 Jul 2020 at 14:27, Amit Khandekar  wrote:
> I tried this in utils/adt/Makefile :
> +
> +numeric.o: CFLAGS += ${CFLAGS_VECTOR}
> +
> and it works.
>
> CFLAGS_VECTOR also includes the -funroll-loops option, which I
> believe, had showed improvements in the checksum.c runs ( [1] ). This
> option makes the object file a bit bigger. For numeric.o, it's size
> increased by 15K; from 116672 to 131360 bytes. I ran the
> multiplication test, and didn't see any additional speed-up with this
> option. Also, it does not seem to be related to vectorization. So I
> was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and
> CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and
> numeric.c can use only CFLAGS_VECTOR.

I did as above. Attached is the v2 patch.

In case of existing CFLAGS_VECTOR, an env variable also could be set
by that name when running configure. I did the same for
CFLAGS_UNROLL_LOOPS.

Now, developers who already are using CFLAGS_VECTOR env while
configur'ing might be using this env because their compilers don't
have these compiler options  so they must be using some equivalent
compiler options. numeric.c will now be compiled with CFLAGS_VECTOR,
so for them  it will now be compiled with their equivalent of
vectorize and unroll-loops option, which is ok, I think. Just that the
numeric.o size will be increased, that's it.

>
> [1] 
> https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e




-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 029373ab2e9d2e6802326871f248ba2f21ffb204 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Tue, 21 Jul 2020 16:42:51 +0800
Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric
 product

A 'for' loop in mul_var() runs backwards by decrementing two
variables. This prevents the gcc compiler from auto-vectorizing the
for loop. So make it a forward loop with a single variable. This gives
performance benefits for product of numeric types with large
precision, with speedups becoming noticeable from values
with precisions starting from 20-40. Typical pattern of benefit is :
precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on.
On some CPU architectures, the speedup starts from 20 precision
onwards.
With the precisions used in the numeric_big regression test, the
multiplication speeds up by 2.5 to 2.7 times.

Auto-vectorization happens with gcc -O3 flag or -ftree-loop-vectorize.
So arrange for -free-loop-vectorize flag specifically when compiling
numeric.c. CFLAGS_VECTOR was already present for similar
functionality in checksum.c, but CFLAGS_VECTOR also includes
-funroll-loops which unecessarily makes numeric.o larger. So split
CFLAGS_VECTOR into CFLAGS_UNROLL_LOOPS and CFLAGS_VECTOR.
---
 configure | 17 +++--
 configure.in  |  9 +++--
 src/Makefile.global.in|  1 +
 src/backend/storage/page/Makefile |  2 +-
 src/backend/utils/adt/Makefile|  3 +++
 src/backend/utils/adt/numeric.c   | 11 ---
 6 files changed, 31 insertions(+), 12 deletions(-)

diff --git a/configure b/configure
index cb8fbe1051..36ca734ad5 100755
--- a/configure
+++ b/configure
@@ -734,6 +734,7 @@ CPP
 CFLAGS_SL
 BITCODE_CXXFLAGS
 BITCODE_CFLAGS
+CFLAGS_UNROLL_LOOPS
 CFLAGS_VECTOR
 PERMIT_DECLARATION_AFTER_STATEMENT
 LLVM_BINPATH
@@ -5266,10 +5267,13 @@ BITCODE_CFLAGS=""
 user_BITCODE_CXXFLAGS=$BITCODE_CXXFLAGS
 BITCODE_CXXFLAGS=""
 
-# set CFLAGS_VECTOR from the environment, if available
+# set CFLAGS_VECTOR and CFLAGS_UNROLL_LOOPS from the environment, if available
 if test "$ac_env_CFLAGS_VECTOR_set" = set; then
   CFLAGS_VECTOR=$ac_env_CFLAGS_VECTOR_value
 fi
+if test "$ac_env_CFLAGS_UNROLL_LOOPS_set" = set; then
+  CFLAGS_UNROLL_LOOPS=$ac_env_CFLAGS_UNROLL_LOOPS_value
+fi
 
 # Some versions of GCC support some additional useful warning flags.
 # Check whether they are supported, and add them to CFLAGS if so.
@@ -6102,16 +6106,16 @@ if test x"$pgac_cv_prog_CXX_cxxflags__fexcess_precision_standard" = x"yes"; then
 fi
 
 
-  # Optimization flags for specific files that benefit from vectorization
-  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR" >&5
-$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_VECTOR... " >&6; }
+  # Optimization flags for specific files that benefit from loop unrolling
+  { $as_echo "$as_me:${as_lineno-$LINENO}: checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS" >&5
+$as_echo_n "checking whether ${CC} supports -funroll-loops, for CFLAGS_UNROLL_LOOPS... " >&6; }
 if ${pgac_cv_prog_CC_cflags__funroll_loops+:} false; then :
   $as_echo_n "(cached) " >

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-07-13 Thread Amit Khandekar

On Fri, 10 Jul 2020 at 19:02, Tom Lane  wrote:
>
> Peter Eisentraut  writes:
> > We normally don't compile with -O3, so very few users would get the
> > benefit of this.
>
> Yeah.  I don't think changing that baseline globally would be a wise move.
>
> > We have CFLAGS_VECTOR for the checksum code.  I
> > suppose if we are making the numeric code vectorizable as well, we
> > should apply this there also.
>
> > I think we need a bit of a policy decision from the group here.
>
> I'd vote in favor of applying CFLAGS_VECTOR to specific source files
> that can benefit.  We already have experience with that and we've not
> detected any destabilization potential.

I tried this in utils/adt/Makefile :
+
+numeric.o: CFLAGS += ${CFLAGS_VECTOR}
+
and it works.

CFLAGS_VECTOR also includes the -funroll-loops option, which I
believe, had showed improvements in the checksum.c runs ( [1] ). This
option makes the object file a bit bigger. For numeric.o, it's size
increased by 15K; from 116672 to 131360 bytes. I ran the
multiplication test, and didn't see any additional speed-up with this
option. Also, it does not seem to be related to vectorization. So I
was thinking of splitting the CFLAGS_VECTOR into CFLAGS_VECTOR and
CFLAGS_UNROLL_LOOPS. Checksum.c can use both these flags, and
numeric.c can use only CFLAGS_VECTOR.

I was also wondering if it's worth to extract only the code that we
think can be optimized and keep it in a separate file (say
numeric_vectorize.c or adt_vectorize.c, which can have mul_var() to
start with), and use this file as a collection of all such code in the
adt module, and then we can add such files into other modules as and
when needed. For numeric.c, there can be already some scope for
auto-vectorizations in other similar regions in that file, so we don't
require a separate numeric_vectorize.c and just pass the CFLAGS_VECTOR
flag for this file itself.

[1] 
https://www.postgresql.org/message-id/flat/CA%2BU5nML8JYeGqM-k4eEwNJi5H%3DU57oPLBsBDoZUv4cfcmdnpUA%40mail.gmail.com#2ec419817ff429588dd1229fb663080e

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: calling procedures is slow and consumes extra much memory against calling function

2020-07-09 Thread Amit Khandekar

On Wed, 17 Jun 2020 at 13:54, Pavel Stehule  wrote:
>
>
>
> st 17. 6. 2020 v 7:52 odesílatel Amit Khandekar  
> napsal:
>>
>> On Wed, 10 Jun 2020 at 17:12, Pavel Stehule  wrote:
>> > st 10. 6. 2020 v 12:26 odesílatel Amit Khandekar  
>> > napsal:
>> >> Could you show an example testcase that tests this recursive scenario,
>> >> with which your earlier patch fails the test, and this v2 patch passes
>> >> it ? I am trying to understand the recursive scenario and the re-use
>> >> of expr->plan.
>> >
>> >
>> > it hangs on plpgsql tests. So you can apply first version of patch
>> >
>> > and "make check"
>>
>> I could not reproduce the make check hang with the v1 patch. But I
>> could see a crash with the below testcase. So I understand the purpose
>> of the plan_owner variable that you introduced in v2.
>>
>> Consider this recursive test :
>>
>> create or replace procedure p1(in r int) as $$
>> begin
>>RAISE INFO 'r : % ', r;
>>if r < 3 then
>>   call p1(r+1);
>>end if;
>> end
>> $$ language plpgsql;
>>
>> do $$
>> declare r int default 1;
>> begin
>> call p1(r);
>> end;
>> $$;
>>
>> In p1() with r=2, when the stmt "call p1(r+1)" is being executed,
>> consider this code of exec_stmt_call() with your v2 patch applied:
>> if (expr->plan && !expr->plan->saved)
>> {
>>if (plan_owner)
>>   SPI_freeplan(expr->plan);
>>expr->plan = NULL;
>> }
>>
>> Here, plan_owner is false. So SPI_freeplan() will not be called, and
>> expr->plan is set to NULL. Now I have observed that the stmt pointer
>> and expr pointer is shared between the p1() execution at this r=2
>> level and the p1() execution at r=1 level. So after the above code is
>> executed at r=2, when the upper level (r=1) exec_stmt_call() lands to
>> the same above code snippet, it gets the same expr pointer, but it's
>> expr->plan is already set to NULL without being freed. From this
>> logic, it looks like the plan won't get freed whenever the expr/stmt
>> pointers are shared across recursive levels, since expr->plan is set
>> to NULL at the lowermost level ? Basically, the handle to the plan is
>> lost so no one at the upper recursion level can explicitly free it
>> using SPI_freeplan(), right ? This looks the same as the main issue
>> where the plan does not get freed for non-recursive calls. I haven't
>> got a chance to check if we can develop a testcase for this, similar
>> to your testcase where the memory keeps on increasing.
>
>
> This is a good consideration.
>
> I am sending updated patch

Checked the latest patch. Looks like using a local plan rather than
expr->plan pointer for doing the checks does seem to resolve the issue
I raised. That made me think of another scenario :

Now we are checking for plan value and then null'ifying the expr->plan
value. What if expr->plan is different from plan ? Is it possible ? I
was thinking of such scenarios. But couldn't find one. As long as a
plan is always created with saved=true for all levels, or with
saved=false for all levels, we are ok. If we can have a mix of saved
and unsaved plans at different recursion levels, then expr->plan can
be different from the outer local plan because then the expr->plan
will not be set to NULL in the inner level, while the outer level may
have created its own plan. But I think a mix of saved and unsaved
plans are not possible. If you agree, then I think we should at least
have an assert that looks like :

if (plan && !plan->saved)
{
if (plan_owner)
SPI_freeplan(plan);

/* If expr->plan  is present, it must be the same plan that we
allocated */
   Assert ( !expr->plan || plan == expr->plan) );

expr->plan = NULL;
}

Other than this, I have no other issues. I understand that we have to
do this special handling only for this exec_stmt_call() because it is
only here that we call exec_prepare_plan() with keep_plan = false, so
doing special handling for freeing the plan seems to make sense.

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-07-08 Thread Amit Khandekar

FYI : this one is present in the July commitfest.

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-07-06 Thread Amit Khandekar

On Sat, 4 Jul 2020 at 01:19, Tom Lane  wrote:
>
> I did some performance testing on 0001+0002 here, and found that
> for me, there's basically no change on x86_64 but a win of 2 to 3
> percent on aarch64, using Pavel's pi_est_1() as a representative
> case for simple plpgsql statements.  That squares with your original
> results I believe.  It's not clear to me whether any of the later
> tests in this thread measured these changes in isolation, or only
> with 0003 added.

Yeah I had the same observation. 0001+0002 seems to benefit mostly on
aarch64. And 0003 (exec_case_value) benefited both on amd64 and
aarch64.

>
> Anyway, that's good enough for me, so I pushed 0001+0002 after a
> little bit of additional cosmetic tweaking.
>
> I attach your original 0003 here (it still applies, with some line
> offsets).  That's just so the cfbot doesn't get confused about what
> it's supposed to test now.

Thanks for pushing all the three !

Thanks,
-Amit Khandekar
Huawei Technologies

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-07-03 Thread Amit Khandekar

On Thu, 2 Jul 2020 at 03:47, Tom Lane  wrote:
>
> Amit Khandekar  writes:
> > There are a couple of function call overheads I observed in pl/pgsql
> > code : exec_stmt() and exec_cast_value(). Removing these overheads
> > resulted in some performance gains.
>
> I took a look at the 0001/0002 patches (not 0003 as yet).  I do not
> like 0001 too much.  The most concrete problem with it is that
> you broke translatability of the error messages, cf the first
> translatability guideline at [1].

Yeah, I thought we can safely use %s for proper nouns such as "trigger
procedure" or "function" as those would not be translated. But looks
like even if they won't be translated, the difference in word order
among languages might create problems with this.

> While that could be fixed by passing
> the entire message not just part of it, I don't see anything that we're
> gaining by moving that stuff into exec_toplevel_block in the first place.
> Certainly, passing a string that describes what will happen *after*
> exec_toplevel_block is just weird.  I think what you've got here is
> a very arbitrary chopping-up of the existing code based on chance
> similarities of the existing callers.  I think we're better off to make
> exec_toplevel_block be as nearly as possible a match for exec_stmts'
> semantics.

I thought some of those things that I kept in exec_toplevel_block() do
look like they belong to a top level function. But what you are saying
also makes sense : better to keep it similar to exec_stmts.

>
> Hence, I propose the attached 0001 to replace 0001/0002.  This should
> be basically indistinguishable performance-wise, though I have not
> tried to benchmark.

Thanks for the patches. Yeah, performance-wise it does look similar;
but anyways I tried running, and got similar performance numbers.

> Note that for reviewability's sake, I did not
> reindent the former body of exec_stmt, though we'd want to do that
> before committing.

Right.

>
> Also, 0002 is a small patch on top of that to avoid redundant saves
> and restores of estate->err_stmt within the loop in exec_stmts.  This
> may well not be a measurable improvement, but it's a pretty obvious
> inefficiency in exec_stmts now that it's refactored this way.

0002 also makes sense.

Re: calling procedures is slow and consumes extra much memory against calling function

2020-06-16 Thread Amit Khandekar

On Wed, 10 Jun 2020 at 17:12, Pavel Stehule  wrote:
> st 10. 6. 2020 v 12:26 odesílatel Amit Khandekar  
> napsal:
>> Could you show an example testcase that tests this recursive scenario,
>> with which your earlier patch fails the test, and this v2 patch passes
>> it ? I am trying to understand the recursive scenario and the re-use
>> of expr->plan.
>
>
> it hangs on plpgsql tests. So you can apply first version of patch
>
> and "make check"

I could not reproduce the make check hang with the v1 patch. But I
could see a crash with the below testcase. So I understand the purpose
of the plan_owner variable that you introduced in v2.

Consider this recursive test :

create or replace procedure p1(in r int) as $$
begin
   RAISE INFO 'r : % ', r;
   if r < 3 then
  call p1(r+1);
   end if;
end
$$ language plpgsql;

do $$
declare r int default 1;
begin
call p1(r);
end;
$$;

In p1() with r=2, when the stmt "call p1(r+1)" is being executed,
consider this code of exec_stmt_call() with your v2 patch applied:
if (expr->plan && !expr->plan->saved)
{
   if (plan_owner)
  SPI_freeplan(expr->plan);
   expr->plan = NULL;
}

Here, plan_owner is false. So SPI_freeplan() will not be called, and
expr->plan is set to NULL. Now I have observed that the stmt pointer
and expr pointer is shared between the p1() execution at this r=2
level and the p1() execution at r=1 level. So after the above code is
executed at r=2, when the upper level (r=1) exec_stmt_call() lands to
the same above code snippet, it gets the same expr pointer, but it's
expr->plan is already set to NULL without being freed. From this
logic, it looks like the plan won't get freed whenever the expr/stmt
pointers are shared across recursive levels, since expr->plan is set
to NULL at the lowermost level ? Basically, the handle to the plan is
lost so no one at the upper recursion level can explicitly free it
using SPI_freeplan(), right ? This looks the same as the main issue
where the plan does not get freed for non-recursive calls. I haven't
got a chance to check if we can develop a testcase for this, similar
to your testcase where the memory keeps on increasing.

-Amit

Re: Auto-vectorization speeds up multiplication of large-precision numerics

2020-06-10 Thread Amit Khandekar

On Wed, 10 Jun 2020 at 04:20, Peter Eisentraut
 wrote:
>
> On 2020-06-09 13:50, Amit Khandekar wrote:
> > Also, the regress/sql/numeric_big test itself speeds up by 80%
>
> That's nice.  I can confirm the speedup:
>
> -O3 without the patch:
>
>   numeric  ... ok  737 ms
> test numeric_big  ... ok 1014 ms
>
> -O3 with the patch:
>
>   numeric  ... ok  680 ms
> test numeric_big  ... ok  580 ms
>
> Also:
>
> -O2 without the patch:
>
>   numeric  ... ok  693 ms
> test numeric_big  ... ok 1160 ms
>
> -O2 with the patch:
>
>   numeric  ... ok  677 ms
> test numeric_big  ... ok  917 ms
>
> So the patch helps either way.

Oh, I didn't observe that the patch helps numeric_big.sql to speed up
to some extent even with -O2. For me, it takes 805 on head and 732 ms
with patch. One possible reason that I can think of is : Because of
the forward loop traversal, pre-fetching might be helping. But this is
just a wild guess.

-O3 : HEAD
test numeric  ... ok  102 ms
test numeric_big  ... ok  803 ms

-O3 : patched :
test numeric  ... ok  100 ms
test numeric_big  ... ok  450 ms

-O2 : HEAD
test numeric  ... ok  100 ms
test numeric_big  ... ok  805 ms

-O2 patched :
test numeric  ... ok  103 ms
test numeric_big  ... ok  732 ms

> But it also seems that without the patch, -O3 might
> be a bit slower in some cases. This might need more testing.

For me, there is no observed change in the times with -O2 versus -O3,
on head. Are you getting a consistent slower numeric*.sql tests with
-O3 on current code ? Not sure what might be the reason.
But this is not related to the patch. Is it with the context of patch
that you are suggesting that it might need more testing ? There might
be existing cases that might be running a bit slower with O3, but
that's strange actually. Probably optimization in those cases might
not be working as thought by the compiler, and in fact they might be
working negatively. Probably that's one of the reasons -O2 is the
default choice.

>
> > For the for loop to be auto-vectorized, I had to simplify it to
> > something like this :
>
> Well, how do we make sure we keep it that way?  How do we prevent some
> random rearranging of the code or some random compiler change to break
> this again?

I believe the compiler rearranges the code segments w.r.t. one another
when those are independent of each other. I guess the compiler is able
to identify that. With our case, it's the for loop. There are some
variables used inside it, and that would prevent it from moving the
for loop. Even if the compiler finds it safe to move relative to
surrounding code, it would not spilt the for loop contents themselves,
so the for loop will remain intact, and so would the vectorization,
although it seems to keep an unrolled version of the loop in case it
is called with smaller iteration counts. But yes, if someone in the
future tries to change the for loop, it would possibly break the
auto-vectorization. So, we should have appropriate comments (patch has
those). Let me know if you find any possible reasons due to which the
compiler would in the future stop the vectorization even when there is
no change in the for loop.

It might look safer if we take the for loop out into an inline
function; just to play it safe ?

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-10 Thread Amit Khandekar

On Tue, 9 Jun 2020 at 21:49, Pavel Stehule  wrote:
> Is your patch in commitfest in commitfest application?

Thanks for reminding me. Just added.
https://commitfest.postgresql.org/28/2590/

Re: calling procedures is slow and consumes extra much memory against calling function

2020-06-10 Thread Amit Khandekar

On Sat, 16 May 2020 at 00:07, Pavel Stehule  wrote:
>
> Hi
>
>>>
>>> The problem is in plpgsql implementation of CALL statement
>>>
>>> In non atomic case -  case of using procedures from DO block, the 
>>> expression plan is not cached, and plan is generating any time. This is 
>>> reason why it is slow.
>>>
>>> Unfortunately, generated plans are not released until SPI_finish. Attached 
>>> patch fixed this issue.
>>
>>
>> But now, recursive calling doesn't work :-(. So this patch is not enough
>
>
> Attached patch is working - all tests passed

Could you show an example testcase that tests this recursive scenario,
with which your earlier patch fails the test, and this v2 patch passes
it ? I am trying to understand the recursive scenario and the re-use
of expr->plan.

>
> It doesn't solve performance, and doesn't solve all memory problems, but 
> significantly reduce memory requirements from 5007 bytes to 439 bytes per one 
> CALL

So now  this patch's intention is to reduce memory consumption, and it
doesn't target slowness improvement, right ?

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Auto-vectorization speeds up multiplication of large-precision numerics

2020-06-09 Thread Amit Khandekar

to-vectorization is
-ftree-loop-vectorize. But for -O3 it is always true. What we can do
in the future is to have a separate file that has such optimized code
that is proven to work with such optimization flags, and enable the
required compiler flags only for such files, if the build is done with
-O2.

[1] https://gcc.gnu.org/projects/tree-ssa/vectorization.html#using


-- 
Thanks,
-Amit Khandekar
Huawei Technologies


create_schema.sql
Description: application/sql
From 91aa5ef1034b763e279b4a8970101355e4a79600 Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Tue, 9 Jun 2020 16:35:23 +0530
Subject: [PATCH] Auto-vectorize loop to speedup large-precision numeric
 product

A 'for' loop in mul_var() runs backwards by decrementing two
variables. This prevents the gcc compiler from auto-vectorizing the
for loop. So make it a forward loop with a single variable. This gives
performance benefits for product of numeric types with large
precision, with speedups becoming noticeable from values
with precisions starting from 20-40. Typical pattern of benefit is :
precision 50: 5%; precision 60: 11%; 120 : 50%; 240: 2.2x; and so on.
On some CPU architectures, the speedup starts from 20 precision
onwards.
With the precisions used in the numeric_big regression test, the
multiplication speeds up by 2.5 to 2.7 times.

Auto-vectorization happens with -O3 flag or -ftree-loop-vectorize. So
this benefit with be seen when built with gcc -O3.
---
 src/backend/utils/adt/numeric.c | 11 ---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/backend/utils/adt/numeric.c b/src/backend/utils/adt/numeric.c
index f3a725271e..4243242ad9 100644
--- a/src/backend/utils/adt/numeric.c
+++ b/src/backend/utils/adt/numeric.c
@@ -7226,6 +7226,7 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 	int			res_weight;
 	int			maxdigits;
 	int		   *dig;
+	int		   *digptr;
 	int			carry;
 	int			maxdig;
 	int			newdig;
@@ -7362,10 +7363,14 @@ mul_var(const NumericVar *var1, const NumericVar *var2, NumericVar *result,
 		 *
 		 * As above, digits of var2 can be ignored if they don't contribute,
 		 * so we only include digits for which i1+i2+2 <= res_ndigits - 1.
+		 *
+		 * For large precisions, this can become a bottleneck; so keep this for
+		 * loop simple so that it can be auto-vectorized.
 		 */
-		for (i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3), i = i1 + i2 + 2;
-			 i2 >= 0; i2--)
-			dig[i--] += var1digit * var2digits[i2];
+		i2 = Min(var2ndigits - 1, res_ndigits - i1 - 3);
+		digptr = [i1 + 2];
+		for (i = 0; i <= i2; i++)
+			digptr[i] += var1digit * var2digits[i];
 	}
 
 	/*
-- 
2.17.1

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-01 Thread Amit Khandekar

On Mon, 1 Jun 2020 at 12:27, Pavel Stehule  wrote:
> po 1. 6. 2020 v 8:15 odesílatel Amit Khandekar  
> napsal:
>>
>> On Sat, 30 May 2020 at 11:11, Pavel Stehule  wrote:
>> > I think so the effect of these patches strongly depends on CPU and compile
>>
>> I quickly tried pi() with gcc 10 as well, and saw more or less the
>> same benefit. I think, we are bound to see some differences in the
>> benefits across architectures, kernels and compilers; but looks like
>> some benefit is always there.
>>
>> > but it is micro optimization, and when I look to profiler, the bottle neck 
>> > is elsewhere.
>>
>> Please check the perf numbers in my reply to Michael. I suppose you
>> meant CachedPlanIsSimplyValid() when you say the bottle neck is
>> elsewhere ? Yeah, this function is always the hottest spot, which I
>> recall is being discussed elsewhere. But I always see exec_stmt(),
>> exec_assign_value as the next functions.
>
>
> It is hard to read the profile result, because these functions are nested 
> together. For your example
>
> 18.22%  postgres  postgres   [.] CachedPlanIsSimplyValid
>
> Is little bit strange, and probably this is real bottleneck in your simple 
> example, and maybe some work can be done there, because you assign just 
> constant.

I had earlier had a quick look on this one. CachedPlanIsSimplyValid()
was, I recall, hitting a hotspot when it tries to access
plansource->search_path (possibly cacheline miss). But didn't get a
chance to further dig on that. For now, i am focusing on these other
functions for which the patches were submitted.


>
> On second hand, your example is pretty unrealistic - and against any 
> developer's best practices for writing cycles.
>
> I think so we can look on PostGIS, where is some computing heavy routines in 
> PLpgSQL, and we can look on real profiles.
>
> Probably the most people will have benefit from these optimization.

I understand it's not a real world example. For generating perf
figures, I had to use an example which amplifies the benefits, so that
the effect of the patches on the perf figures also becomes visible.
Hence, used that example. I had shown the benefits up-thread using a
practical function avg_space(). But the perf figures for that example
were varying a lot.

So below, what I did was : Run the avg_space() ~150 times, and took
perf report. This stabilized the results a bit :

HEAD :
+   16.10%  17.29%  16.82%  postgres  postgres[.]
ExecInterpExpr
+   13.80%  13.56%  14.49%  postgres  plpgsql.so  [.]
exec_assign_value
+   12.64%  12.10%  12.09%  postgres  plpgsql.so  [.]
plpgsql_param_eval_var
+   12.15%  11.28%  11.05%  postgres  plpgsql.so  [.]
exec_stmt
+   10.81%  10.24%  10.55%  postgres  plpgsql.so  [.]
exec_eval_expr
+9.50%   9.35%   9.37%  postgres  plpgsql.so  [.]
exec_cast_value
.
+1.19%   1.06%   1.21%  postgres  plpgsql.so  [.]
exec_stmts


0001+0002 patches applied (i.e. inline exec_stmt) :
+   16.90%  17.20%  16.54%  postgres  postgres[.]
ExecInterpExpr
+   16.42%  15.37%  15.28%  postgres  plpgsql.so  [.]
exec_assign_value
+   11.34%  11.92%  11.93%  postgres  plpgsql.so  [.]
plpgsql_param_eval_var
+   11.18%  11.86%  10.99%  postgres  plpgsql.so  [.] exec_stmts.part.0
+   10.51%   9.52%  10.61%  postgres  plpgsql.so  [.]
exec_eval_expr
+9.39%   9.48%   9.30%  postgres  plpgsql.so  [.]
exec_cast_value

HEAD : exec_stmts + exec_stmt = ~12.7 %
Patched (0001+0002): exec_stmts = 11.3 %

Just 0003 patch applied (i.e. inline exec_cast_value) :
+   17.00%  16.77%  17.09% postgres  postgres   [.] ExecInterpExpr
+   15.21%  15.64%  15.09% postgres  plpgsql.so [.] exec_assign_value
+   14.48%  14.06%  13.94% postgres  plpgsql.so [.] exec_stmt
+   13.26%  13.30%  13.14% postgres  plpgsql.so [.]
plpgsql_param_eval_var
+   11.48%  11.64%  12.66% postgres  plpgsql.so [.] exec_eval_expr

+1.03%   0.85%   0.87% postgres  plpgsql.so [.] exec_stmts

HEAD : exec_assign_value + exec_cast_value = ~23.4 %
Patched (0001+0002): exec_assign_value =  15.3%


Time in milliseconds after calling avg_space() 150 times :
HEAD  : 7210
Patch 0001+0002 : 6925
Patch 0003 : 6670
Patch 0001+0002+0003 : 6346

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-01 Thread Amit Khandekar

On Sat, 30 May 2020 at 11:11, Pavel Stehule  wrote:
> I think so the effect of these patches strongly depends on CPU and compile

I quickly tried pi() with gcc 10 as well, and saw more or less the
same benefit. I think, we are bound to see some differences in the
benefits across architectures, kernels and compilers; but looks like
some benefit is always there.

> but it is micro optimization, and when I look to profiler, the bottle neck is 
> elsewhere.

Please check the perf numbers in my reply to Michael. I suppose you
meant CachedPlanIsSimplyValid() when you say the bottle neck is
elsewhere ? Yeah, this function is always the hottest spot, which I
recall is being discussed elsewhere. But I always see exec_stmt(),
exec_assign_value as the next functions.

>> > patch 0002 it is little bit against simplicity, but for PLpgSQL with 
>> > blocks structure it is correct.
>>
>> Here, I moved the exec_stmt code into exec_stmts() function because
>> exec_stmts() was the only caller, and that function is not that big. I
>> am assuming you were referring to this point when you said it is a bit
>> against simplicity. But I didn't get what you implied by "but for
>> PLpgSQL with blocks structure it is correct"
>
>
> Nested statement in PLpgSQL is always a list of statements. It is not single 
> statement ever. So is not too strange don't have a function execute_stmt.

Right.

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-06-01 Thread Amit Khandekar

On Sun, 31 May 2020 at 08:04, Michael Paquier  wrote:
> This stuff is interesting.  Do you have some perf profiles to share?
> I am wondering what's the effect of the inlining with your test
> cases.

Below are the perf numbers for asignmany.sql :

HEAD :

+   16.88%  postgres  postgres   [.] CachedPlanIsSimplyValid
+   16.64%  postgres  plpgsql.so [.] exec_stmt
+   15.56%  postgres  plpgsql.so [.] exec_eval_expr
+   13.58%  postgres  plpgsql.so [.] exec_assign_value
+7.49%  postgres  plpgsql.so [.] exec_cast_value
+7.17%  postgres  plpgsql.so [.] exec_assign_expr
+5.39%  postgres  postgres   [.] MemoryContextReset
+3.91%  postgres  postgres   [.] ExecJustConst
+3.33%  postgres  postgres   [.] recomputeNamespacePath
+2.88%  postgres  postgres   [.] OverrideSearchPathMatchesCurrent
+2.18%  postgres  plpgsql.so [.] exec_eval_cleanup.isra.17
+2.15%  postgres  plpgsql.so [.] exec_stmts
+1.32%  postgres  plpgsql.so [.] MemoryContextReset@plt
+0.57%  postgres  plpgsql.so [.] CachedPlanIsSimplyValid@plt
+0.57%  postgres  postgres   [.] GetUserId
 0.30%  postgres  plpgsql.so [.] assign_simple_var.isra.13
 0.05%  postgres  [kernel.kallsyms]  [k] unmap_page_range

Patched :

+   18.22%  postgres  postgres   [.] CachedPlanIsSimplyValid
+   17.25%  postgres  plpgsql.so [.] exec_eval_expr
+   16.31%  postgres  plpgsql.so [.] exec_stmts
+   15.00%  postgres  plpgsql.so [.] exec_assign_value
+7.56%  postgres  plpgsql.so [.] exec_assign_expr
+5.64%  postgres  postgres   [.] MemoryContextReset
+5.16%  postgres  postgres   [.] ExecJustConst
+4.86%  postgres  postgres   [.] recomputeNamespacePath
+4.54%  postgres  postgres   [.] OverrideSearchPathMatchesCurrent
+2.33%  postgres  plpgsql.so [.] exec_eval_cleanup.isra.17
+1.26%  postgres  plpgsql.so [.] MemoryContextReset@plt
+0.81%  postgres  postgres   [.] GetUserId
+0.71%  postgres  plpgsql.so [.] CachedPlanIsSimplyValid@plt
 0.26%  postgres  plpgsql.so [.] assign_simple_var.isra.13
 0.03%  postgres  [kernel.kallsyms]  [k] unmap_page_range
 0.02%  postgres  [kernel.kallsyms]  [k] mark_page_accessed

Notice the reduction in percentages :
HEAD : exec_stmts + exec_stmt = 18.79
Patched : exec_stmts = 16.31

HEAD : exec_assign_value + exec_cast_value : 21.07
Patched : exec_assign_value = 15.00

As expected, reduction of percentage in these two functions caused
other functions like CachedPlanIsSimplyValid() and exec_eval_expr() to
show rise in their percentages.

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-05-29 Thread Amit Khandekar

On Thu, 28 May 2020 at 14:39, Pavel Stehule  wrote:
> I don't see 17% anywhere, but 3-5% is not bad.
Did you see 3-5% only for the pi function, or did you see the same
improvement also for the functions that I wrote ? I was getting a
consistent result of 14-18 % on both of the VMs. Also, is your test
machine running on Windows ? All the machines I tested were on Linux
kernel (Ubuntu)

Below are my results for your pi_est_1() function. For this function,
I am consistently getting 5-9 % improvement. I tested on 3 machines :

gcc : 8.4.0. -O2 option
OS : Ubuntu Bionic

explain analyze select pi_est_1(1000)

1. x86_64 laptop VM (Intel Core i7-8665U)
HEAD :2666 2617 2600 2630
Patched : 2502 2409 2460 2444

2. x86_64 VM (Xeon Gold 6151)
HEAD :1664 1662 1721 1660
Patched : 1541 1548 1537 1526

3. ARM64 VM (Kunpeng)
HEAD :2873 2864 2860 2861
Patched : 2568 2513 2501 2538

>
> patch 0001 has sense and can help with code structure
> patch 0002 it is little bit against simplicity, but for PLpgSQL with blocks 
> structure it is correct.

Here, I moved the exec_stmt code into exec_stmts() function because
exec_stmts() was the only caller, and that function is not that big. I
am assuming you were referring to this point when you said it is a bit
against simplicity. But I didn't get what you implied by "but for
PLpgSQL with blocks structure it is correct"

-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-05-27 Thread Amit Khandekar

On Tue, 26 May 2020 at 09:06, Amit Khandekar  wrote:
>
> On Sat, 23 May 2020 at 23:24, Pavel Stehule  wrote:
> >
> >FOR counter IN 1..180 LOOP
> >   id = 0; id = 0; id1 = 0;
> >   id2 = 0; id3 = 0; id1 = 0; id2 = 0;
> >   id3 = 0; id = 0; id = 0; id1 = 0;
> >   id2 = 0; id3 = 0; id1 = 0; id2 = 0;
> >   id3 = 0;
> >END LOOP;
> >
> > This is not too much typical PLpgSQL code. All expressions are not 
> > parametrized - so this test is little bit obscure.
> >
> > Last strange performance plpgsql benchmark did calculation of pi value. It 
> > does something real
>
> Yeah, basically I wanted to have many statements, and that too with
> many assignments where casts are not required. Let me check if I can
> come up with a real-enough testcase. Thanks.

create table tab (id int[]);
insert into tab select array((select ((random() * 10)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 60)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 100)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 10)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 60)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 100)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 10)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 60)::bigint) id
from generate_series(1, 3) order by 1));
insert into tab select array((select ((random() * 100)::bigint) id
from generate_series(1, 3) order by 1));


-- Return how much two consecutive array elements are apart from each
other, on average; i.e. how much the numbers are spaced out.
-- Input is an ordered array of integers.
CREATE OR REPLACE FUNCTION avg_space(int[]) RETURNS bigint AS $$
DECLARE
  diff int = 0;
  num int;
  prevnum int = 1;
BEGIN
  FOREACH num IN ARRAY $1
  LOOP
diff = diff + num - prevnum;
prevnum = num;
  END LOOP;
  RETURN diff/array_length($1, 1);
END;
$$ LANGUAGE plpgsql;

explain analyze select avg_space(id) from tab;
Like earlier figures, these are execution times in milliseconds, taken
from explain-analyze.
ARM VM:
   HEAD : 49.8
   patch 0001+0002   : 47.8 => 4.2%
   patch 0001+0002+0003 : 42.9 => 16.1%
x86 VM:
   HEAD : 32.8
   patch 0001+0002   : 32.7 => 0%
   patch 0001+0002+0003 : 28.0 => 17.1%

Re: Inlining of couple of functions in pl_exec.c improves performance

2020-05-25 Thread Amit Khandekar

On Sat, 23 May 2020 at 23:24, Pavel Stehule  wrote:
>
>FOR counter IN 1..180 LOOP
>   id = 0; id = 0; id1 = 0;
>   id2 = 0; id3 = 0; id1 = 0; id2 = 0;
>   id3 = 0; id = 0; id = 0; id1 = 0;
>   id2 = 0; id3 = 0; id1 = 0; id2 = 0;
>   id3 = 0;
>END LOOP;
>
> This is not too much typical PLpgSQL code. All expressions are not 
> parametrized - so this test is little bit obscure.
>
> Last strange performance plpgsql benchmark did calculation of pi value. It 
> does something real

Yeah, basically I wanted to have many statements, and that too with
many assignments where casts are not required. Let me check if I can
come up with a real-enough testcase. Thanks.

Inlining of couple of functions in pl_exec.c improves performance

2020-05-23 Thread Amit Khandekar

There are a couple of function call overheads I observed in pl/pgsql
code : exec_stmt() and exec_cast_value(). Removing these overheads
resulted in some performance gains.

exec_stmt() :

plpgsql_exec_function() and other toplevel block executors currently
call exec_stmt(). But actually they don't need to do everything that
exec_stmt() does. So they can call a new function instead of
exec_stmt(), and all the exec_stmt() code can be moved to
exec_stmts(). The things that exec_stmt() do, but are not necessary
for a top level block stmt, are :

1. save_estmt = estate->err_stmt; estate->err_stmt = stmt;
For top level blocks, saving the estate->err_stmt is not necessary,
because there is no statement after this block statement. Anyways,
plpgsql_exec_function() assigns estate.err_stmt just before calling
exec_stmt so there is really no point in exec_stmt() setting it again.

2. CHECK_FOR_INTERRUPTS()
This is not necessary for toplevel block callers.

3. exec_stmt_block() can be directly called rather than exec_stmt()
because func->action is a block statement. So the switch statement is
not necessary.

But this one might be necessary for toplevel block statement:
  if (*plpgsql_plugin_ptr && (*plpgsql_plugin_ptr)->stmt_beg)
 ((*plpgsql_plugin_ptr)->stmt_beg) (estate, stmt);

There was already a repetitive code in plpgsql_exec_function() and
other functions around the exec_stmt() call. So in a separate patch
0001*.patch, I moved that code into a common function
exec_toplevel_block(). In the main patch
0002-Get-rid-of-exec_stmt-function-call.patch, I additionally called
plpgsql_plugin_ptr->stmt_beg() inside exec_toplevel_block(). And moved
exec_stmt() code into exec_stmts().



exec_cast_value() :

This function does not do the casting if not required. So moved the
code that actually does the cast into a separate function, so as to
reduce the exec_cast_value() code and make it inline. Attached is the
0003-Inline-exec_cast_value.patch


Testing
--

I used two available VMs (one x86_64 and the other arm64), and the
benefit showed up on both of these machines. Attached patches 0001,
0002, 0003 are to be applied in that order. 0001 is just a preparatory
patch.

First I tried with a simple for loop with a single assignment
(attached forcounter.sql)

By inlining of the two functions, found noticeable reduction in
execution time as shown (figures are in milliseconds, averaged over
multiple runs; taken from 'explain analyze' execution times) :
ARM VM :
   HEAD : 100 ; Patched : 88 => 13.6% improvement
x86 VM :
   HEAD :  71 ; Patched : 66 => 7.63% improvement.

Then I included many assignment statements as shown in attachment
assignmany.sql. This showed further benefit :
ARM VM :
   HEAD : 1820 ; Patched : 1549  => 17.5% improvement
x86 VM :
   HEAD : 1020 ; Patched :  869  => 17.4% improvement

Inlining just exec_stmt() showed the improvement mainly on the arm64
VM (7.4%). For x86, it was 2.7%
But inlining exec_stmt() and exec_cast_value() together showed
benefits on both machines, as can be seen above.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
From 66c607ef6f0b7b655819b4b19383e024c5f8788c Mon Sep 17 00:00:00 2001
From: Amit Khandekar 
Date: Sat, 23 May 2020 21:39:41 +0800
Subject: [PATCH 1/3] Modularize code in toplevel pl/pgsql block callers

Functions that call exec_stmt() for executing the toplevel block have
a repetitive code that is now moved into a common function
exec_toplevel_block(). This in preparation for further changes in
this part of code.
---
 src/pl/plpgsql/src/pl_exec.c | 86 
 1 file changed, 38 insertions(+), 48 deletions(-)

diff --git a/src/pl/plpgsql/src/pl_exec.c b/src/pl/plpgsql/src/pl_exec.c
index 9a87cd70f1..0a70ceddbb 100644
--- a/src/pl/plpgsql/src/pl_exec.c
+++ b/src/pl/plpgsql/src/pl_exec.c
@@ -267,6 +267,9 @@ static MemoryContext get_stmt_mcontext(PLpgSQL_execstate *estate);
 static void push_stmt_mcontext(PLpgSQL_execstate *estate);
 static void pop_stmt_mcontext(PLpgSQL_execstate *estate);
 
+static void exec_toplevel_block(PLpgSQL_execstate *estate,
+PLpgSQL_stmt_block *block,
+char *object_type, char *err_text);
 static int	exec_stmt_block(PLpgSQL_execstate *estate,
 			PLpgSQL_stmt_block *block);
 static int	exec_stmts(PLpgSQL_execstate *estate,
@@ -475,7 +478,6 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo,
 	PLpgSQL_execstate estate;
 	ErrorContextCallback plerrcontext;
 	int			i;
-	int			rc;
 
 	/*
 	 * Setup the execution state
@@ -605,23 +607,8 @@ plpgsql_exec_function(PLpgSQL_function *func, FunctionCallInfo fcinfo,
 	/*
 	 * Now call the toplevel block of statements
 	 */
-	estate.err_text = NULL;
-	estate.err_stmt = (PLpgSQL_stmt *) (func->action);
-	rc = exec_stmt(, (PLpgSQL_stmt *) func->action);
-	if (rc != PLPGSQL_RC_RETURN)
-	{
-		estate.err_stmt = NULL;
-		estate.err_text = NULL;
-		ereport(ERROR,
-(errcode(ERRC

pgbench testing with contention scenarios

2020-04-20 Thread Amit Khandekar

Subject changed. Earlier it was : spin_delay() for ARM

On Fri, 17 Apr 2020 at 22:54, Robert Haas  wrote:
>
> On Thu, Apr 16, 2020 at 3:18 AM Amit Khandekar  wrote:
> > Not relevant to the PAUSE stuff  Note that when the parallel
> > clients reach from 24 to 32 (which equals the machine CPUs), the TPS
> > shoots from 454189 to 1097592 which is more than double speed gain
> > with just a 30% increase in parallel sessions.
For referencem the TPS can be seen here :
https://www.postgresql.org/message-id/CAJ3gD9e86GY%3DQfyfZQkb11Z%2BCVWowDiGgGThzKKwHDGU9uA2yA%40mail.gmail.com
>
> I've seen stuff like this too. For instance, check out the graph from
> this 2012 blog post:
>
> http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-about-64.html
>
> You can see that the performance growth is basically on a straight
> line up to about 16 cores, but then it kinks downward until about 28,
> after which it kinks sharply upward until about 36 cores.
>
> I think this has something to do with the process scheduling behavior
> of Linux, because I vaguely recall some discussion where somebody did
> benchmarking on the same hardware on both Linux and one of the BSD
> systems, and the effect didn't appear on BSD. They had other problems,
> like a huge drop-off at higher core counts, but they didn't have that
> effect.

Ah I see.

By the way, I have observed this behaviour in both x86 and ARM,
regardless of whether the CPUs are as low as 8, or as high as 32.

But for me, I suspect it's a combination of linux scheduler and
interactions between backends and pgbench clients.

So I used a custom script. I used the same point query that is
used with the -S option, but that query is run again and again on the
server side without the client having to send it again and again, so
the pgbench clients are idle most of the time.

Query used :
select foo(30);
where foo(int) is defined as :
create or replace function foo(iterations int) returns int as $$
declare
   id int; ret int ; counter int = 0;
begin
   WHILE counter < iterations
   LOOP
  counter = counter + 1;
  id = random() * 300;
  select into ret aid from pgbench_accounts where aid = id;
   END LOOP;
  return ret;
end $$ language plpgsql;

Below are results for 30 scale factor, with 8 CPUs :

Clients  TPS
21.255327
42.414139
63.532937
84.586583
10   4.557575
12   4.517226
14   4.551455
18   4.593271

You can see that the tps rise is almost linearly proportional to
increase in clients, with no deviation in between, upto 8 clients
where it does not rise because CPUs are fully utilized,

In this custom case as well, the behaviour is same for both x86 and
ARM, regardless of 8 CPUs or 32 CPUs.


--
Thanks,
-Amit Khandekar
Huawei Technologies

Re: spin_delay() for ARM

2020-04-20 Thread Amit Khandekar

On Sat, 18 Apr 2020 at 03:30, Thomas Munro  wrote:
>
> On Sat, Apr 18, 2020 at 2:00 AM Ants Aasma  wrote:
> > On Thu, 16 Apr 2020 at 10:33, Pavel Stehule  wrote:
> > > what I know, pgbench cannot be used for testing spinlocks problems.
> > >
> > > Maybe you can see this issue when a) use higher number clients - 
> > > hundreds, thousands. Decrease share memory, so there will be press on 
> > > related spin lock.
> >
> > There really aren't many spinlocks left that could be tickled by a
> > normal workload. I looked for a way to trigger spinlock contention
> > when I prototyped a patch to replace spinlocks with futexes. The only
> > one that I could figure out a way to make contended was the lock
> > protecting parallel btree scan. A highly parallel index only scan on a
> > fully cached index should create at least some spinlock contention.
>
> I suspect the snapshot-too-old "mutex_threshold" spinlock can become
> contended under workloads that generate a high rate of
> heap_page_prune_opt() calls with old_snapshot_threshold enabled.  One
> way to do that is with a bunch of concurrent index scans that hit the
> heap in random order.  Some notes about that:
>
> https://www.postgresql.org/message-id/flat/CA%2BhUKGKT8oTkp5jw_U4p0S-7UG9zsvtw_M47Y285bER6a2gD%2Bg%40mail.gmail.com

Thanks all for the inputs. Will keep these two particular scenarios in
mind, and try to get some bandwidth on this soon.


-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: spin_delay() for ARM

2020-04-16 Thread Amit Khandekar

On Mon, 13 Apr 2020 at 20:16, Amit Khandekar  wrote:
> On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
> >
> > I wrote:
> > > A more useful test would be to directly experiment with contended
> > > spinlocks.  As I recall, we had some test cases laying about when
> > > we were fooling with the spin delay stuff on Intel --- maybe
> > > resurrecting one of those would be useful?
> >
> > The last really significant performance testing we did in this area
> > seems to have been in this thread:
> >
> > https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
> >
> > A relevant point from that is Haas' comment
> >
> > I think optimizing spinlocks for machines with only a few CPUs is
> > probably pointless.  Based on what I've seen so far, spinlock
> > contention even at 16 CPUs is negligible pretty much no matter what
> > you do.  Whether your implementation is fast or slow isn't going to
> > matter, because even an inefficient implementation will account for
> > only a negligible percentage of the total CPU time - much less than 1%
> > - as opposed to a 64-core machine, where it's not that hard to find
> > cases where spin-waits consume the *majority* of available CPU time
> > (recall previous discussion of lseek).
>
> Yeah, will check if I find some machines with large cores.

I got hold of a 32 CPUs VM (actually it was a 16-core, but being
hyperthreaded, CPUs were 32).
It was an Intel Xeon , 3Gz CPU. 15G available memory. Hypervisor :
KVM. Single NUMA node.
PG parameters changed : shared_buffer: 8G ; max_connections : 1000

I compared pgbench results with HEAD versus PAUSE removed like this :
 perform_spin_delay(SpinDelayStatus *status)
 {
-   /* CPU-specific delay each time through the loop */
-   SPIN_DELAY();

Ran with increasing number of parallel clients :
pgbench -S -c $num -j $num -T 60 -M prepared
But couldn't find any significant change in the TPS numbers with or
without PAUSE:

Clients HEAD Without_PAUSE
8 26   247264
16399939   399549
24454189   453244
32   1097592  1098844
40   1090424  1087984
48   1068645  1075173
64   1035035  1039973
96976578   970699

May be it will indeed show some difference only with around 64 cores,
or perhaps a bare metal machine will help; but as of now I didn't get
such a machine. Anyways, I thought why not archive the results with
whatever I have.

Not relevant to the PAUSE stuff  Note that when the parallel
clients reach from 24 to 32 (which equals the machine CPUs), the TPS
shoots from 454189 to 1097592 which is more than double speed gain
with just a 30% increase in parallel sessions. I was not expecting
this much speed gain, because, with contended scenario already pgbench
processes are already taking around 20% of the total CPU time of
pgbench run. May be later on, I will get a chance to run with some
customized pgbench script that runs a server function which keeps on
running an index scan on pgbench_accounts, so as to make pgbench
clients almost idle.

Thanks
-Amit Khandekar

Re: spin_delay() for ARM

2020-04-13 Thread Amit Khandekar

On Sat, 11 Apr 2020 at 04:18, Tom Lane  wrote:
>
> I wrote:
> > A more useful test would be to directly experiment with contended
> > spinlocks.  As I recall, we had some test cases laying about when
> > we were fooling with the spin delay stuff on Intel --- maybe
> > resurrecting one of those would be useful?
>
> The last really significant performance testing we did in this area
> seems to have been in this thread:
>
>
https://www.postgresql.org/message-id/flat/CA%2BTgmoZvATZV%2BeLh3U35jaNnwwzLL5ewUU_-t0X%3DT0Qwas%2BZdA%40mail.gmail.com
>
> A relevant point from that is Haas' comment
>
> I think optimizing spinlocks for machines with only a few CPUs is
> probably pointless.  Based on what I've seen so far, spinlock
> contention even at 16 CPUs is negligible pretty much no matter what
> you do.  Whether your implementation is fast or slow isn't going to
> matter, because even an inefficient implementation will account for
> only a negligible percentage of the total CPU time - much less than 1%
> - as opposed to a 64-core machine, where it's not that hard to find
> cases where spin-waits consume the *majority* of available CPU time
> (recall previous discussion of lseek).

Yeah, will check if I find some machines with large cores.


> So I wonder whether this patch is getting ahead of the game.  It does
> seem that ARM systems with a couple dozen cores exist, but are they
> common enough to optimize for yet?  Can we even find *one* to test on
> and verify that this is a win and not a loss?  (Also, seeing that
> there are so many different ARM vendors, results from just one
> chipset might not be too trustworthy ...)

Ok. Yes, it would be worth waiting to see if there are others in the
community with ARM systems that have implemented YIELD. May be after that
we might gain some confidence. I myself also hope that I will get one soon
to test, but right now I have one that does not support it, so it will be
just a no-op.

-- 
Thanks,
-Amit Khandekar
Huawei Technologies
-- 
Thanks,
-Amit Khandekar
Huawei Technologies

Re: spin_delay() for ARM

2020-04-13 Thread Amit Khandekar

On Sat, 11 Apr 2020 at 00:47, Andres Freund  wrote:
>
> Hi,
>
> On 2020-04-10 13:09:13 +0530, Amit Khandekar wrote:
> > On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
> > using a sample C program (attached spin.c). Here, many child processes
> > (much more than CPUs) wait in a tight loop for a shared variable to
> > become 0, while the parent process continuously increments a sequence
> > number for a fixed amount of time, after which, it sets the shared
> > variable to 0. The child's tight loop calls PAUSE in each iteration.
> > What I hoped was that because of PAUSE in children, the parent process
> > would get more share of the CPU, due to which, in a given time, the
> > sequence number will reach a higher value. Also, I expected the CPU
> > cycles spent by child processes to drop down, thanks to PAUSE. None of
> > these happened. There was no change.
>
> > Possibly, this testcase is not right. Probably the process preemption
> > occurs only within the set of hyperthreads attached to a single core.
> > And in my testcase, the parent process is the only one who is ready to
> > run. Still, I have anyway attached the program (spin.c) for archival;
> > in case somebody with a YIELD-supporting ARM machine wants to use it
> > to test YIELD.
>
> PAUSE doesn't operate on the level of the CPU scheduler. So the OS won't
> just schedule another process - you won't see different CPU usage if you
> measure it purely as the time running.

Yeah, I thought that the OS scheduling would be an *indirect* consequence
of the pause because of it's slowing down the CPU, but looks like that does
not happen.


> You should be able to see a
> difference if you measure with a profiler that shows you data from the
> CPUs performance monitoring unit.
Hmm, I had tried with perf and could see the pause itself consuming 5% cpu.
But I haven't yet played with per-process figures.



-- 
Thanks,
-Amit Khandekar
Huawei Technologies
-- 
Thanks,
-Amit Khandekar
Huawei Technologies

spin_delay() for ARM

2020-04-10 Thread Amit Khandekar

Hi,

We use (an equivalent of) the PAUSE instruction in spin_delay() for
Intel architectures. The goal is to slow down the spinlock tight loop
and thus prevent it from eating CPU and causing CPU starvation, so
that other processes get their fair share of the CPU time. Intel
documentation [1] clearly mentions this, along with other benefits of
PAUSE, like, low power consumption, and avoidance of memory order
violation while exiting the loop.

Similar to PAUSE, the ARM architecture has YIELD instruction, which is
also clearly documented [2]. It explicitly says that it is a way to
hint the CPU that it is being called in a spinlock loop and this
process can be preempted out. But for ARM, we are not using any kind
of spin delay.

For PG spinlocks, the goal of both of these instructions are the same,
and also both architectures recommend using them in spinlock loops.
Also, I found multiple places where YIELD is already used in same
situations : Linux kernel [3] ; OpenJDK [4],[5]

Now, for ARM implementations that don't implement YIELD, it runs as a
no-op. Unfortunately the ARM machine I have does not implement YIELD.
But recently there has been some ARM implementations that are
hyperthreaded, so they are expected to actually do the YIELD, although
the docs do not explicitly say that YIELD has to be implemented only
by hyperthreaded implementations.

I ran some pgbench tests to test PAUSE/YIELD on the respective
architectures, once with the instruction present, and once with the
instruction removed. Didn't see change in the TPS numbers; they were
more or less same. For Arm, this was expected because my ARM machine
does not implement it.

On my Intel Xeon machine with 8 cores, I tried to test PAUSE also
using a sample C program (attached spin.c). Here, many child processes
(much more than CPUs) wait in a tight loop for a shared variable to
become 0, while the parent process continuously increments a sequence
number for a fixed amount of time, after which, it sets the shared
variable to 0. The child's tight loop calls PAUSE in each iteration.
What I hoped was that because of PAUSE in children, the parent process
would get more share of the CPU, due to which, in a given time, the
sequence number will reach a higher value. Also, I expected the CPU
cycles spent by child processes to drop down, thanks to PAUSE. None of
these happened. There was no change.

Possibly, this testcase is not right. Probably the process preemption
occurs only within the set of hyperthreads attached to a single core.
And in my testcase, the parent process is the only one who is ready to
run. Still, I have anyway attached the program (spin.c) for archival;
in case somebody with a YIELD-supporting ARM machine wants to use it
to test YIELD.

Nevertheless, I think because we have clear documentation that
strongly recommends to use it, and because it has been used in other
use-cases such as linux kernel and JDK, we should start using YIELD
for spin_delay() in ARM.

Attached is the trivial patch (spin_delay_for_arm.patch). To start
with, it contains changes only for aarch64. I haven't yet added
changes in configure[.in] for making sure yield compiles successfully
(YIELD is present in manuals from ARMv6 onwards). Before that I
thought of getting some comments; so didn't do configure changes yet.

[1] https://c9x.me/x86/html/file_module_x86_id_232.html
[2]
https://developer.arm.com/docs/100076/0100/instruction-set-reference/a64-general-instructions/yield
[3]
https://elixir.bootlin.com/linux/latest/source/arch/arm64/include/asm/processor.h#L259
[4] http://cr.openjdk.java.net/~dchuyko/8186670/yield/spinwait.html
[5]
http://mail.openjdk.java.net/pipermail/aarch64-port-dev/2017-August/004880.html

--
Thanks,
-Amit Khandekar
Huawei Technologies
/*
* Sample program to test the effect of PAUSE/YIELD instruction in a highly
* contended scenario. The Intel and ARM docs recommend the use of PAUSE and
* YIELD respectively, in spinlock tight loops.
*
* This program can be run with :
* gcc -O3 -o spin spin.c -lrt ; ./spin [number_of_processes]
* By default, 4 processes are spawned.
*
* Child processes wait in a tight loop for a shared variable to become 0,
* while the parent process continuously increments a sequence number for a
* fixed amount of time, after which, it sets the shared variable to 0. The
* child tight loop calls YIELD/PAUSE in each iteration.
*
* The intention is to create a number of processes much larger than the
* available CPUs, so that the scheduler hopefully pre-empts the processes
* because of the PAUSE, and the main process gets more CPU share because of
* which it will increment its sequence number more number of times. So the
* expectation is that with PAUSE, the program will end up with a much higher
* sequence number than without PAUSE. Similarly, the child processes should
* have lesser CPU cycles with PAUSE than without PAUSE.
*
* Author: Amit Khandekar
*/

#include
#include
#include
#include

1 2 3 >

1 - 100 of 227 matches

Mail list logo