Re: Probable memory leak with ECPG and AIX

2021-12-31 Thread Noah Misch
On Wed, Dec 15, 2021 at 04:20:42PM +0100, Benoit Lobréau wrote:
> * with LDR_CNTRL=MAXDATA=0x1000, we reach 256Mo but there is no
> segfault, the program just continues running ;
> * with LDR_CNTRL=MAXDATA=0x8000, we reach 2Go and there is no segfault
> either, the program just continues running.

I get the same results.  The leak arises because AIX freelocale() doesn't free
all memory allocated in newlocale().  The following program uses trivial
memory on GNU/Linux, but it leaks like you're seeing on AIX:

#include 
int main(int argc, char **argv)
{
while (1)
freelocale(newlocale(LC_NUMERIC_MASK, "C", (locale_t) 0));
return 0;
}

If you have access to file an AIX bug, I recommend doing so.  If we want
PostgreSQL to work around this, one idea is to have ECPG do this newlocale()
less often.  For example, do it once per process or once per connection
instead of once per ecpg_do_prologue().




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-12-31 Thread Zhihong Yu
On Fri, Dec 31, 2021 at 5:45 PM David Rowley  wrote:

> On Fri, 3 Dec 2021 at 20:36, Michael Paquier  wrote:
> > Two months later, this has been switched to RwF.
>
> I was discussing this patch with Andres. He's not very keen on my
> densehash hash table idea and suggested that instead of relying on
> trying to make the hash table iteration faster, why don't we just
> ditch the lossiness of the resource owner code which only records the
> first 16 locks, and just make it have a linked list of all locks.
>
> This is a little more along the lines of the original patch, however,
> it does not increase the size of the LOCALLOCK struct.
>
> I've attached a patch which does this.  This was mostly written
> (quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
> fixed a few bugs.
>
> I ran the same performance tests on this patch as I did back in [1]:
>
> -- Test 1. Select 1 record from a 140 partitioned table. Tests
> creating a large number of locks with a fast query.
>
> create table hp (a int, b int) partition by hash(a);
> select 'create table hp'||x||' partition of hp for values with
> (modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
> create index on hp (b);
> insert into hp select x,x from generate_series(1, 14) x;
> analyze hp;
>
> select3.sql: select * from hp where b = 1
>
> select3.sql master
>
> drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
> tps = 2099.708748 (without initial connection time)
> tps = 2100.398516 (without initial connection time)
> tps = 2094.882341 (without initial connection time)
> tps = 2113.218820 (without initial connection time)
> tps = 2104.717597 (without initial connection time)
>
> select3.sql patched
>
> drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
> tps = 2010.070738 (without initial connection time)
> tps = 1994.963606 (without initial connection time)
> tps = 1994.668849 (without initial connection time)
> tps = 1995.948168 (without initial connection time)
> tps = 1985.650945 (without initial connection time)
>
> You can see that there's a performance regression here. I've not yet
> studied why this appears.
>
> -- Test 2. Tests a prepared query which will perform a generic plan on
> the 6th execution then fallback on a custom plan due to it pruning all
> but one partition.  Master suffers from the lock table becoming
> bloated after locking all partitions when planning the generic plan.
>
> create table ht (a int primary key, b int, c int) partition by hash (a);
> select 'create table ht' || x::text || ' partition of ht for values
> with (modulus 8192, remainder ' || (x)::text || ');' from
> generate_series(0,8191) x;
> \gexec
>
> select.sql:
> \set p 1
> select * from ht where a = :p
>
> select.sql master
> drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 18014.460090 (without initial connection time)
> tps = 17973.358889 (without initial connection time)
> tps = 17847.480647 (without initial connection time)
> tps = 18038.332507 (without initial connection time)
> tps = 17776.143206 (without initial connection time)
>
> select.sql patched
> drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 32393.457106 (without initial connection time)
> tps = 32277.204349 (without initial connection time)
> tps = 32160.719830 (without initial connection time)
> tps = 32530.038130 (without initial connection time)
> tps = 32299.019657 (without initial connection time)
>
> You can see that there are some quite good performance gains with this
> test.
>
> I'm going to add this to the January commitfest.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com

Hi,

+   locallock->nLocks -= locallockowner->nLocks;
+   Assert(locallock->nLocks >= 0);

I think the assertion is not needed since the above code is in if block :

+   if (locallockowner->nLocks < locallock->nLocks)

the condition, locallock->nLocks >= 0, would always hold after the
subtraction.

Cheers


Re: Fix BUG #17335: Duplicate result rows in Gather node

2021-12-31 Thread David Rowley
On Fri, 31 Dec 2021 at 00:14, Yura Sokolov  wrote:
> Problem:
> - Append path is created with explicitely parallel_aware = true
> - It has two child, one is trivial, other is parallel_aware = false .
>   Trivial child is dropped.
> - Gather/GatherMerge path takes Append path as a child and thinks
>   its child is parallel_aware = true.
> - But Append path is removed at the last since it has only one child.
> - Now Gather/GatherMerge thinks its child is parallel_aware, but it
>   is not.
>   Gather/GatherMerge runs its child twice: in a worker and in a leader,
>   and gathers same rows twice.

Thanks for the report. I can confirm that I can recreate the problem
with your script.

I will look into this further later next week.

David




Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-12-31 Thread David Rowley
On Fri, 3 Dec 2021 at 20:36, Michael Paquier  wrote:
> Two months later, this has been switched to RwF.

I was discussing this patch with Andres. He's not very keen on my
densehash hash table idea and suggested that instead of relying on
trying to make the hash table iteration faster, why don't we just
ditch the lossiness of the resource owner code which only records the
first 16 locks, and just make it have a linked list of all locks.

This is a little more along the lines of the original patch, however,
it does not increase the size of the LOCALLOCK struct.

I've attached a patch which does this.  This was mostly written
(quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
fixed a few bugs.

I ran the same performance tests on this patch as I did back in [1]:

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 14) x;
analyze hp;

select3.sql: select * from hp where b = 1

select3.sql master

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2099.708748 (without initial connection time)
tps = 2100.398516 (without initial connection time)
tps = 2094.882341 (without initial connection time)
tps = 2113.218820 (without initial connection time)
tps = 2104.717597 (without initial connection time)

select3.sql patched

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2010.070738 (without initial connection time)
tps = 1994.963606 (without initial connection time)
tps = 1994.668849 (without initial connection time)
tps = 1995.948168 (without initial connection time)
tps = 1985.650945 (without initial connection time)

You can see that there's a performance regression here. I've not yet
studied why this appears.

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition.  Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

select.sql master
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 18014.460090 (without initial connection time)
tps = 17973.358889 (without initial connection time)
tps = 17847.480647 (without initial connection time)
tps = 18038.332507 (without initial connection time)
tps = 17776.143206 (without initial connection time)

select.sql patched
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 32393.457106 (without initial connection time)
tps = 32277.204349 (without initial connection time)
tps = 32160.719830 (without initial connection time)
tps = 32530.038130 (without initial connection time)
tps = 32299.019657 (without initial connection time)

You can see that there are some quite good performance gains with this test.

I'm going to add this to the January commitfest.

David

[1] 
https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com


speedup_releasing_all_locks.patch
Description: Binary data


slowest tap tests - split or accelerate?

2021-12-31 Thread Andres Freund
Hi,

cfbot now runs most tests on windows, the windows task is by far the slowest,
and the task limitted most in concurrency [2]. Running tap tests is the
biggest part of that. This is a bigger issue on windows because we don't have
infrastructure (yet) to run tests in parallel.

There's a few tests which stand out in their slowness, which seem worth
addressing even if we tackle test parallelism on windows at some point. I
often find them to be the slowest tests on linux too.

Picking a random successful cfbot run [1] I see the following tap tests taking
more than 20 seconds:

67188 ms pg_basebackup t/010_pg_basebackup.pl
59710 ms recovery t/001_stream_rep.pl
57542 ms pg_rewind t/001_basic.pl
56179 ms subscription t/001_rep_changes.pl
42146 ms pgbench t/001_pgbench_with_server.pl
38264 ms recovery t/018_wal_optimize.pl
33642 ms subscription t/013_partition.pl
29129 ms pg_dump t/002_pg_dump.pl
25751 ms pg_verifybackup t/002_algorithm.pl
20628 ms subscription t/011_generated.pl

It would be good if we could make those tests faster, or if not easily
possible, at least split those tests into smaller tap tests.

Splitting a longer test into smaller ones is preferrable even if they take the
same time, because we can use prove level concurrency on windows to gain some
test parallelism. As a nice side-effect it makes it also quicker to run a
split test isolated during development.

Greetings,

Andres Freund

[1] https://cirrus-ci.com/task/5207126145499136
[2] https://cirrus-ci.org/faq/#are-there-any-limits




Re: Add header support to text format and matching feature

2021-12-31 Thread Rémi Lapeyre
Here’s an updated version of the patch that takes into account the changes in 
d1029bb5a2. The actual code is the same as v10 which was already marked as 
ready for committer.




v11-0001-Add-header-support-to-COPY-TO-text-format.patch
Description: Binary data


v11-0002-Add-header-matching-mode-to-COPY-FROM.patch
Description: Binary data


> On 11 Apr 2021, at 16:35, Rémi Lapeyre  wrote:
> 
>> 
 This now reads "Represents whether the header must be absent, present or 
 match.”. 
>> 
>> Since match shouldn't be preceded with be, I think we can say:
>> 
>> Represents whether the header must match, be absent or be present.
> 
> Thanks, here’s a v10 version of the patch that fixes this.
> 
> 



Re: Column Filtering in Logical Replication

2021-12-31 Thread Justin Pryzby
> @@ -5963,8 +5967,20 @@ describePublications(const char *pattern)
>   {
>   /* Get the tables for the specified publication */
>   printfPQExpBuffer(,
> -   "SELECT n.nspname, 
> c.relname\n"
> -   "FROM 
> pg_catalog.pg_class c,\n"
> +   "SELECT n.nspname, 
> c.relname, \n");
> + if (pset.sversion >= 15)
> + appendPQExpBufferStr(,
> +  "  
>  CASE WHEN pr.prattrs IS NOT NULL THEN\n"
> +  "  
>  pg_catalog.array_to_string"
> +  
> "(ARRAY(SELECT attname\n"
> +  "  
>FROM pg_catalog.generate_series(0, 
> pg_catalog.array_upper(pr.prattrs::int[], 1)) s,\n"
> +  "  
> pg_catalog.pg_attribute\n"
> +  "  
>   WHERE attrelid = c.oid AND attnum = prattrs[s]), ', ')\n"
> +  "  
>  ELSE NULL END AS columns");
> + else
> + appendPQExpBufferStr(, "NULL as columns");
> + appendPQExpBuffer(,
> +   "\nFROM 
> pg_catalog.pg_class c,\n"
> " 
> pg_catalog.pg_namespace n,\n"
> " 
> pg_catalog.pg_publication_rel pr\n"
> "WHERE c.relnamespace 
> = n.oid\n"

I suppose this should use pr.prattrs::pg_catalog.int2[] ?

Did the DatumGetBool issue expose a deficiency in testing ?
I think the !am_partition path was never being hit.

-- 
Justin




Re: Extend compatibility of PostgreSQL::Test::Cluster

2021-12-31 Thread Andrew Dunstan


On 12/31/21 11:20, Dagfinn Ilmari Mannsåker wrote:
> Andrew Dunstan  writes:
>
>> +my $subclass = __PACKAGE__ . "::V_$maj";
>> +bless $node, $subclass;
>> +unless ($node->isa(__PACKAGE__))
>> +{
>> +# It's not a subclass, so re-bless back into the main 
>> package
>> +bless($node, __PACKAGE__);
>> +carp "PostgreSQL::Test::Cluster isn't fully compatible 
>> with version $ver";
>> +}
> The ->isa() method works on package names as well as blessed objects, so
> the back-and-forth blessing can be avoided.
>
>   my $subclass = __PACKAGE__ . "::V_$maj";
>   if ($subclass->isa(__PACKAGE__))
>   {
>   bless($node, $subclass);
>   }
>   else
>   {
>   carp "PostgreSQL::Test::Cluster isn't fully compatible with 
> version $ver";
>   }
>

OK, thanks, will fix in next version.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com





Re: Extend compatibility of PostgreSQL::Test::Cluster

2021-12-31 Thread Dagfinn Ilmari Mannsåker
Andrew Dunstan  writes:

> + my $subclass = __PACKAGE__ . "::V_$maj";
> + bless $node, $subclass;
> + unless ($node->isa(__PACKAGE__))
> + {
> + # It's not a subclass, so re-bless back into the main 
> package
> + bless($node, __PACKAGE__);
> + carp "PostgreSQL::Test::Cluster isn't fully compatible 
> with version $ver";
> + }

The ->isa() method works on package names as well as blessed objects, so
the back-and-forth blessing can be avoided.

my $subclass = __PACKAGE__ . "::V_$maj";
if ($subclass->isa(__PACKAGE__))
{
bless($node, $subclass);
}
else
{
carp "PostgreSQL::Test::Cluster isn't fully compatible with 
version $ver";
}

- ilmari




Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode

2021-12-31 Thread Hans Buschmann
INTENTION

Inspired by the effort to integrate JIT for executor acceleration I thought 
selected simple algorithms working with array-oriented data should be 
drastically accelerated by using SIMD instructions on modern hardware.

I want to introduce this style of programming with the example of hex_encode:
- operates on arrays (bytea)
- simple algorithm
- in some situations partially limiting the performance (e.g pg_dump)

IMPLEMENTATION GUIDELINES

The main goal ist to accelerate common cases on the most common hardware by 
exploiting all the resources the hardware delivers.
The following guidelines took me to a first implementation:

- restrict on 64 -bit architectures
These are the dominant server architectures, have the necessary data 
formats and corresponding registers and operating instructions
- start with Intel x86-64 SIMD instructions:
This is the vastly most used platform, available for development and in 
practical use
- don’t restrict the concept to only Intel x86-64, so that later people with 
more experience on other architectures can jump in and implement comparable 
algorithms
- fallback to the established implementation in postgres in non appropriate 
cases or on user request (GUC)
- implementation of leaf function/procedures in assembly language
These consist mostly of a central loop without calling subroutines or 
doing additionally branching

- coding for maximum hardware usage instead of elegant programming
Once tested, the simple algorithm works as advertised and is used to 
replace most execution parts of the standard implementaion in C

- isolated footprint by integrating it only in the specific subroutine (here 
hex-encode)
This ensures that the requirements for fast execution are met (e.g. 
buffer sizes) and no repeated checks are needed like in a library use case.

- trying to keep both vector execution ports always doing useful work by 
avoiding waits for latencies

- trying to access memory in linear fashion (reading from input buffer, writing 
to output buffer) to avaoid internal cache problems
- focus optimization for the most advanced SIMD instruction set: AVX512
This provides the most advanced instructions and  quite a lot of large 
registers to aid in latency avoiding

- if possible provide fallback implementations on older SIMD standards (e.g. 
AVX2 or SSE2)
This is usefull on many older servers and client processors, but due to 
their too little number of registers latency avoiding or full execution queues 
cannot be fully achieved.


IMPLEMENTATION DETAILS

- The loops implementing the algorithm are written in NASM assembler:
NASM is actively maintained, has many output formats, follows the Intel 
style, has all current instrucions implemented and is fast.

- The loops are mostly independent of operating systems, so all OS’s basing on 
a NASM obj output format are supported:
This includes Linux and Windows as the most important ones

- The algorithms use advanced techniques (constant and temporary registers) to 
avoid most unnessary memory accesses:
The assembly implementation gives you the full control over the 
registers (unlike intrinsics) 

- Multiple dependency chains work interleaved to minimize latencies:
Coding is often interspersed and using almost all registers available.

- Some instructions (Moves, zeroing) are executed outside the processor 
execution ports:
These don’t consume execution cyles on a port but their latency has to 
be considered.

- Some vector instructions (multiply add) have latencies of 5 for example:
This means that after the instruction is issued, the processor has to 
wait 5 cycles until the result can be used in the same dependency chain. To 
avoid this and keep all vector execution ports (p0 and p5) busy you have to 
have 9 other instructions in between doing work on other streams of the 
algorithm to maximize hardware usage and overall performance.

- All loops are implemented as separate C-callable functions (according to the 
OS calling convention):
They are all leaf functions by calling no other subroutines.

- The decision which implementation is choosen is done at the caller side by a 
special dispatcher routine:
The caller handles the architectural capabilites (instruction sets 
available) and knows the required work: There is often a suitable minimum 
amount of work required for efficently calling a provided implementation.

- Loops should be run at least 2-4 times to compensate for initializing 
overhead:
This implicits a certain amount of minimum work count based on the 
specific SIMD implementations

- The loops terminate after detecting an error (e.g. wrong input data) and 
return the succesfull completed amount of work:
The standard linear implementation takes over with the already 
established error-handling.

- The loops work optimally with some extra output buffer space at the end to 

Re: Adding CI to our tree

2021-12-31 Thread Alvaro Herrera
On 2021-Dec-30, Andres Freund wrote:

> On 2021-12-30 17:46:52 -0800, Andres Freund wrote:
> > I plan to push this after another cycle of tests passing (and driving
> > over the bay bridge...) unless I hear protests.
> 
> Pushed.
> 
> Marked CF entry as committed.

I tried it using the column filter patch.  It worked on the first
attempt.

Thanks!

-- 
Álvaro Herrera  Valdivia, Chile  —  https://www.EnterpriseDB.com/




Re: Parameter for planner estimate of recursive queries

2021-12-31 Thread Simon Riggs
On Wed, 27 Oct 2021 at 15:58, Simon Riggs  wrote:

> The poor performance is traced to the planner cost estimates for
> recursive queries. Specifically, the cost of the recursive arm of the
> query is evaluated based upon both of these hardcoded assumptions:
>
> 1. The recursion will last for 10 loops
> 2. The average size of the worktable will be 10x the size of the
> initial query (non-recursive term).
>
> Taken together these assumptions lead to a very poor estimate of the
> worktable activity (in this case), which leads to the plan changing as
> a result.
>
> The factor 10 is a reasonably safe assumption and helps avoid worst
> case behavior in bigger graph queries. However, the factor 10 is way
> too large for many types of graph query, such as where the path
> through the data is tight, and/or the query is written to prune bushy
> graphs, e.g. shortest path queries. The factor 10 should not be
> hardcoded in the planner, but should be settable, just as
> cursor_tuple_fraction is.

If you think this should be derived without parameters, then we would
want a function that starts at 1 for 1 input row and gets much larger
for larger input. The thinking here is that Graph OLTP is often a
shortest path between two nodes, whereas Graph Analytics and so the
worktable will get much bigger.

So I'm, thinking we use

rel->tuples = min(1, cte_rows * cte_rows/2);

-- 
Simon Riggshttp://www.EnterpriseDB.com/




Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work

2021-12-31 Thread Bharath Rupireddy
Hi,

Currently the server is erroring out when unable to remove/parse a
logical rewrite file in CheckPointLogicalRewriteHeap wasting the
amount of work the checkpoint has done and preventing the checkpoint
from finishing. This is unlike CheckPointSnapBuild does for snapshot
files i.e. it just emits a message at LOG level and continues if it is
unable to parse or remove the file. Attaching a small patch applying
the same idea to the mapping files.

Thoughts?

Regards,
Bharath Rupireddy.


v1-0001-Avoid-erroring-out-when-unable-to-remove-or-parse.patch
Description: Binary data