Re: Probable memory leak with ECPG and AIX
On Wed, Dec 15, 2021 at 04:20:42PM +0100, Benoit Lobréau wrote: > * with LDR_CNTRL=MAXDATA=0x1000, we reach 256Mo but there is no > segfault, the program just continues running ; > * with LDR_CNTRL=MAXDATA=0x8000, we reach 2Go and there is no segfault > either, the program just continues running. I get the same results. The leak arises because AIX freelocale() doesn't free all memory allocated in newlocale(). The following program uses trivial memory on GNU/Linux, but it leaks like you're seeing on AIX: #include int main(int argc, char **argv) { while (1) freelocale(newlocale(LC_NUMERIC_MASK, "C", (locale_t) 0)); return 0; } If you have access to file an AIX bug, I recommend doing so. If we want PostgreSQL to work around this, one idea is to have ECPG do this newlocale() less often. For example, do it once per process or once per connection instead of once per ecpg_do_prologue().
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, Dec 31, 2021 at 5:45 PM David Rowley wrote: > On Fri, 3 Dec 2021 at 20:36, Michael Paquier wrote: > > Two months later, this has been switched to RwF. > > I was discussing this patch with Andres. He's not very keen on my > densehash hash table idea and suggested that instead of relying on > trying to make the hash table iteration faster, why don't we just > ditch the lossiness of the resource owner code which only records the > first 16 locks, and just make it have a linked list of all locks. > > This is a little more along the lines of the original patch, however, > it does not increase the size of the LOCALLOCK struct. > > I've attached a patch which does this. This was mostly written > (quickly) by Andres, I just did a cleanup, fixed up a few mistakes and > fixed a few bugs. > > I ran the same performance tests on this patch as I did back in [1]: > > -- Test 1. Select 1 record from a 140 partitioned table. Tests > creating a large number of locks with a fast query. > > create table hp (a int, b int) partition by hash(a); > select 'create table hp'||x||' partition of hp for values with > (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; > create index on hp (b); > insert into hp select x,x from generate_series(1, 14) x; > analyze hp; > > select3.sql: select * from hp where b = 1 > > select3.sql master > > drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres > tps = 2099.708748 (without initial connection time) > tps = 2100.398516 (without initial connection time) > tps = 2094.882341 (without initial connection time) > tps = 2113.218820 (without initial connection time) > tps = 2104.717597 (without initial connection time) > > select3.sql patched > > drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres > tps = 2010.070738 (without initial connection time) > tps = 1994.963606 (without initial connection time) > tps = 1994.668849 (without initial connection time) > tps = 1995.948168 (without initial connection time) > tps = 1985.650945 (without initial connection time) > > You can see that there's a performance regression here. I've not yet > studied why this appears. > > -- Test 2. Tests a prepared query which will perform a generic plan on > the 6th execution then fallback on a custom plan due to it pruning all > but one partition. Master suffers from the lock table becoming > bloated after locking all partitions when planning the generic plan. > > create table ht (a int primary key, b int, c int) partition by hash (a); > select 'create table ht' || x::text || ' partition of ht for values > with (modulus 8192, remainder ' || (x)::text || ');' from > generate_series(0,8191) x; > \gexec > > select.sql: > \set p 1 > select * from ht where a = :p > > select.sql master > drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 18014.460090 (without initial connection time) > tps = 17973.358889 (without initial connection time) > tps = 17847.480647 (without initial connection time) > tps = 18038.332507 (without initial connection time) > tps = 17776.143206 (without initial connection time) > > select.sql patched > drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 32393.457106 (without initial connection time) > tps = 32277.204349 (without initial connection time) > tps = 32160.719830 (without initial connection time) > tps = 32530.038130 (without initial connection time) > tps = 32299.019657 (without initial connection time) > > You can see that there are some quite good performance gains with this > test. > > I'm going to add this to the January commitfest. > > David > > [1] > https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com Hi, + locallock->nLocks -= locallockowner->nLocks; + Assert(locallock->nLocks >= 0); I think the assertion is not needed since the above code is in if block : + if (locallockowner->nLocks < locallock->nLocks) the condition, locallock->nLocks >= 0, would always hold after the subtraction. Cheers
Re: Fix BUG #17335: Duplicate result rows in Gather node
On Fri, 31 Dec 2021 at 00:14, Yura Sokolov wrote: > Problem: > - Append path is created with explicitely parallel_aware = true > - It has two child, one is trivial, other is parallel_aware = false . > Trivial child is dropped. > - Gather/GatherMerge path takes Append path as a child and thinks > its child is parallel_aware = true. > - But Append path is removed at the last since it has only one child. > - Now Gather/GatherMerge thinks its child is parallel_aware, but it > is not. > Gather/GatherMerge runs its child twice: in a worker and in a leader, > and gathers same rows twice. Thanks for the report. I can confirm that I can recreate the problem with your script. I will look into this further later next week. David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, 3 Dec 2021 at 20:36, Michael Paquier wrote: > Two months later, this has been switched to RwF. I was discussing this patch with Andres. He's not very keen on my densehash hash table idea and suggested that instead of relying on trying to make the hash table iteration faster, why don't we just ditch the lossiness of the resource owner code which only records the first 16 locks, and just make it have a linked list of all locks. This is a little more along the lines of the original patch, however, it does not increase the size of the LOCALLOCK struct. I've attached a patch which does this. This was mostly written (quickly) by Andres, I just did a cleanup, fixed up a few mistakes and fixed a few bugs. I ran the same performance tests on this patch as I did back in [1]: -- Test 1. Select 1 record from a 140 partitioned table. Tests creating a large number of locks with a fast query. create table hp (a int, b int) partition by hash(a); select 'create table hp'||x||' partition of hp for values with (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; create index on hp (b); insert into hp select x,x from generate_series(1, 14) x; analyze hp; select3.sql: select * from hp where b = 1 select3.sql master drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2099.708748 (without initial connection time) tps = 2100.398516 (without initial connection time) tps = 2094.882341 (without initial connection time) tps = 2113.218820 (without initial connection time) tps = 2104.717597 (without initial connection time) select3.sql patched drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2010.070738 (without initial connection time) tps = 1994.963606 (without initial connection time) tps = 1994.668849 (without initial connection time) tps = 1995.948168 (without initial connection time) tps = 1985.650945 (without initial connection time) You can see that there's a performance regression here. I've not yet studied why this appears. -- Test 2. Tests a prepared query which will perform a generic plan on the 6th execution then fallback on a custom plan due to it pruning all but one partition. Master suffers from the lock table becoming bloated after locking all partitions when planning the generic plan. create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p select.sql master drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 18014.460090 (without initial connection time) tps = 17973.358889 (without initial connection time) tps = 17847.480647 (without initial connection time) tps = 18038.332507 (without initial connection time) tps = 17776.143206 (without initial connection time) select.sql patched drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 32393.457106 (without initial connection time) tps = 32277.204349 (without initial connection time) tps = 32160.719830 (without initial connection time) tps = 32530.038130 (without initial connection time) tps = 32299.019657 (without initial connection time) You can see that there are some quite good performance gains with this test. I'm going to add this to the January commitfest. David [1] https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com speedup_releasing_all_locks.patch Description: Binary data
slowest tap tests - split or accelerate?
Hi, cfbot now runs most tests on windows, the windows task is by far the slowest, and the task limitted most in concurrency [2]. Running tap tests is the biggest part of that. This is a bigger issue on windows because we don't have infrastructure (yet) to run tests in parallel. There's a few tests which stand out in their slowness, which seem worth addressing even if we tackle test parallelism on windows at some point. I often find them to be the slowest tests on linux too. Picking a random successful cfbot run [1] I see the following tap tests taking more than 20 seconds: 67188 ms pg_basebackup t/010_pg_basebackup.pl 59710 ms recovery t/001_stream_rep.pl 57542 ms pg_rewind t/001_basic.pl 56179 ms subscription t/001_rep_changes.pl 42146 ms pgbench t/001_pgbench_with_server.pl 38264 ms recovery t/018_wal_optimize.pl 33642 ms subscription t/013_partition.pl 29129 ms pg_dump t/002_pg_dump.pl 25751 ms pg_verifybackup t/002_algorithm.pl 20628 ms subscription t/011_generated.pl It would be good if we could make those tests faster, or if not easily possible, at least split those tests into smaller tap tests. Splitting a longer test into smaller ones is preferrable even if they take the same time, because we can use prove level concurrency on windows to gain some test parallelism. As a nice side-effect it makes it also quicker to run a split test isolated during development. Greetings, Andres Freund [1] https://cirrus-ci.com/task/5207126145499136 [2] https://cirrus-ci.org/faq/#are-there-any-limits
Re: Add header support to text format and matching feature
Here’s an updated version of the patch that takes into account the changes in d1029bb5a2. The actual code is the same as v10 which was already marked as ready for committer. v11-0001-Add-header-support-to-COPY-TO-text-format.patch Description: Binary data v11-0002-Add-header-matching-mode-to-COPY-FROM.patch Description: Binary data > On 11 Apr 2021, at 16:35, Rémi Lapeyre wrote: > >> This now reads "Represents whether the header must be absent, present or match.”. >> >> Since match shouldn't be preceded with be, I think we can say: >> >> Represents whether the header must match, be absent or be present. > > Thanks, here’s a v10 version of the patch that fixes this. > >
Re: Column Filtering in Logical Replication
> @@ -5963,8 +5967,20 @@ describePublications(const char *pattern) > { > /* Get the tables for the specified publication */ > printfPQExpBuffer(, > - "SELECT n.nspname, > c.relname\n" > - "FROM > pg_catalog.pg_class c,\n" > + "SELECT n.nspname, > c.relname, \n"); > + if (pset.sversion >= 15) > + appendPQExpBufferStr(, > + " > CASE WHEN pr.prattrs IS NOT NULL THEN\n" > + " > pg_catalog.array_to_string" > + > "(ARRAY(SELECT attname\n" > + " >FROM pg_catalog.generate_series(0, > pg_catalog.array_upper(pr.prattrs::int[], 1)) s,\n" > + " > pg_catalog.pg_attribute\n" > + " > WHERE attrelid = c.oid AND attnum = prattrs[s]), ', ')\n" > + " > ELSE NULL END AS columns"); > + else > + appendPQExpBufferStr(, "NULL as columns"); > + appendPQExpBuffer(, > + "\nFROM > pg_catalog.pg_class c,\n" > " > pg_catalog.pg_namespace n,\n" > " > pg_catalog.pg_publication_rel pr\n" > "WHERE c.relnamespace > = n.oid\n" I suppose this should use pr.prattrs::pg_catalog.int2[] ? Did the DatumGetBool issue expose a deficiency in testing ? I think the !am_partition path was never being hit. -- Justin
Re: Extend compatibility of PostgreSQL::Test::Cluster
On 12/31/21 11:20, Dagfinn Ilmari Mannsåker wrote: > Andrew Dunstan writes: > >> +my $subclass = __PACKAGE__ . "::V_$maj"; >> +bless $node, $subclass; >> +unless ($node->isa(__PACKAGE__)) >> +{ >> +# It's not a subclass, so re-bless back into the main >> package >> +bless($node, __PACKAGE__); >> +carp "PostgreSQL::Test::Cluster isn't fully compatible >> with version $ver"; >> +} > The ->isa() method works on package names as well as blessed objects, so > the back-and-forth blessing can be avoided. > > my $subclass = __PACKAGE__ . "::V_$maj"; > if ($subclass->isa(__PACKAGE__)) > { > bless($node, $subclass); > } > else > { > carp "PostgreSQL::Test::Cluster isn't fully compatible with > version $ver"; > } > OK, thanks, will fix in next version. cheers andrew -- Andrew Dunstan EDB: https://www.enterprisedb.com
Re: Extend compatibility of PostgreSQL::Test::Cluster
Andrew Dunstan writes: > + my $subclass = __PACKAGE__ . "::V_$maj"; > + bless $node, $subclass; > + unless ($node->isa(__PACKAGE__)) > + { > + # It's not a subclass, so re-bless back into the main > package > + bless($node, __PACKAGE__); > + carp "PostgreSQL::Test::Cluster isn't fully compatible > with version $ver"; > + } The ->isa() method works on package names as well as blessed objects, so the back-and-forth blessing can be avoided. my $subclass = __PACKAGE__ . "::V_$maj"; if ($subclass->isa(__PACKAGE__)) { bless($node, $subclass); } else { carp "PostgreSQL::Test::Cluster isn't fully compatible with version $ver"; } - ilmari
Introducing PgVA aka PostgresVectorAcceleration using SIMD vector instructions starting with hex_encode
INTENTION Inspired by the effort to integrate JIT for executor acceleration I thought selected simple algorithms working with array-oriented data should be drastically accelerated by using SIMD instructions on modern hardware. I want to introduce this style of programming with the example of hex_encode: - operates on arrays (bytea) - simple algorithm - in some situations partially limiting the performance (e.g pg_dump) IMPLEMENTATION GUIDELINES The main goal ist to accelerate common cases on the most common hardware by exploiting all the resources the hardware delivers. The following guidelines took me to a first implementation: - restrict on 64 -bit architectures These are the dominant server architectures, have the necessary data formats and corresponding registers and operating instructions - start with Intel x86-64 SIMD instructions: This is the vastly most used platform, available for development and in practical use - don’t restrict the concept to only Intel x86-64, so that later people with more experience on other architectures can jump in and implement comparable algorithms - fallback to the established implementation in postgres in non appropriate cases or on user request (GUC) - implementation of leaf function/procedures in assembly language These consist mostly of a central loop without calling subroutines or doing additionally branching - coding for maximum hardware usage instead of elegant programming Once tested, the simple algorithm works as advertised and is used to replace most execution parts of the standard implementaion in C - isolated footprint by integrating it only in the specific subroutine (here hex-encode) This ensures that the requirements for fast execution are met (e.g. buffer sizes) and no repeated checks are needed like in a library use case. - trying to keep both vector execution ports always doing useful work by avoiding waits for latencies - trying to access memory in linear fashion (reading from input buffer, writing to output buffer) to avaoid internal cache problems - focus optimization for the most advanced SIMD instruction set: AVX512 This provides the most advanced instructions and quite a lot of large registers to aid in latency avoiding - if possible provide fallback implementations on older SIMD standards (e.g. AVX2 or SSE2) This is usefull on many older servers and client processors, but due to their too little number of registers latency avoiding or full execution queues cannot be fully achieved. IMPLEMENTATION DETAILS - The loops implementing the algorithm are written in NASM assembler: NASM is actively maintained, has many output formats, follows the Intel style, has all current instrucions implemented and is fast. - The loops are mostly independent of operating systems, so all OS’s basing on a NASM obj output format are supported: This includes Linux and Windows as the most important ones - The algorithms use advanced techniques (constant and temporary registers) to avoid most unnessary memory accesses: The assembly implementation gives you the full control over the registers (unlike intrinsics) - Multiple dependency chains work interleaved to minimize latencies: Coding is often interspersed and using almost all registers available. - Some instructions (Moves, zeroing) are executed outside the processor execution ports: These don’t consume execution cyles on a port but their latency has to be considered. - Some vector instructions (multiply add) have latencies of 5 for example: This means that after the instruction is issued, the processor has to wait 5 cycles until the result can be used in the same dependency chain. To avoid this and keep all vector execution ports (p0 and p5) busy you have to have 9 other instructions in between doing work on other streams of the algorithm to maximize hardware usage and overall performance. - All loops are implemented as separate C-callable functions (according to the OS calling convention): They are all leaf functions by calling no other subroutines. - The decision which implementation is choosen is done at the caller side by a special dispatcher routine: The caller handles the architectural capabilites (instruction sets available) and knows the required work: There is often a suitable minimum amount of work required for efficently calling a provided implementation. - Loops should be run at least 2-4 times to compensate for initializing overhead: This implicits a certain amount of minimum work count based on the specific SIMD implementations - The loops terminate after detecting an error (e.g. wrong input data) and return the succesfull completed amount of work: The standard linear implementation takes over with the already established error-handling. - The loops work optimally with some extra output buffer space at the end to
Re: Adding CI to our tree
On 2021-Dec-30, Andres Freund wrote: > On 2021-12-30 17:46:52 -0800, Andres Freund wrote: > > I plan to push this after another cycle of tests passing (and driving > > over the bay bridge...) unless I hear protests. > > Pushed. > > Marked CF entry as committed. I tried it using the column filter patch. It worked on the first attempt. Thanks! -- Álvaro Herrera Valdivia, Chile — https://www.EnterpriseDB.com/
Re: Parameter for planner estimate of recursive queries
On Wed, 27 Oct 2021 at 15:58, Simon Riggs wrote: > The poor performance is traced to the planner cost estimates for > recursive queries. Specifically, the cost of the recursive arm of the > query is evaluated based upon both of these hardcoded assumptions: > > 1. The recursion will last for 10 loops > 2. The average size of the worktable will be 10x the size of the > initial query (non-recursive term). > > Taken together these assumptions lead to a very poor estimate of the > worktable activity (in this case), which leads to the plan changing as > a result. > > The factor 10 is a reasonably safe assumption and helps avoid worst > case behavior in bigger graph queries. However, the factor 10 is way > too large for many types of graph query, such as where the path > through the data is tight, and/or the query is written to prune bushy > graphs, e.g. shortest path queries. The factor 10 should not be > hardcoded in the planner, but should be settable, just as > cursor_tuple_fraction is. If you think this should be derived without parameters, then we would want a function that starts at 1 for 1 input row and gets much larger for larger input. The thinking here is that Graph OLTP is often a shortest path between two nodes, whereas Graph Analytics and so the worktable will get much bigger. So I'm, thinking we use rel->tuples = min(1, cte_rows * cte_rows/2); -- Simon Riggshttp://www.EnterpriseDB.com/
Avoid erroring out when unable to remove or parse logical rewrite files to save checkpoint work
Hi, Currently the server is erroring out when unable to remove/parse a logical rewrite file in CheckPointLogicalRewriteHeap wasting the amount of work the checkpoint has done and preventing the checkpoint from finishing. This is unlike CheckPointSnapBuild does for snapshot files i.e. it just emits a message at LOG level and continues if it is unable to parse or remove the file. Attaching a small patch applying the same idea to the mapping files. Thoughts? Regards, Bharath Rupireddy. v1-0001-Avoid-erroring-out-when-unable-to-remove-or-parse.patch Description: Binary data