Re: [HACKERS] Sum aggregate calculation for single precsion real
On 13.02.2017 19:20, Tom Lane wrote: Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes: I wonder why SUM aggregate is calculated for real (float4) type using floating point accumulator? If you can't deal with the vagaries of floating-point arithmetic, you shouldn't be storing your data in float format. Use numeric. 4-byte floats are widely used for example in trading applications just because it is two times shorter then double and range of stored data is relatively small (do not need a lot of significant digits). At the same time volume of stored data is very large and switching from float4 to float8 will almost double it. It requires two times more storage and almost two times increase query execution time. So this is not acceptable answer. Are there are reasons of using float4pl function for SUM aggregate instead of float4_accum? The latter is probably a good two orders of magnitude slower, and it wouldn't really do much to solve the inherent accuracy problems of adding float4 values that have a wide dynamic range. It is not true - please notice query execution time of this two queries: postgres=# select sum(l_quantity) from lineitem where l_shipdate <= '1998-12-01'; sum - 1.52688e+09 (1 row) Time: 2858.852 ms postgres=# select sum(l_quantity+0.0) from lineitem where l_shipdate <= '1998-12-01'; sum 1529738036 (1 row) Time: 3174.529 ms Looks like now in Postgres aggregate calculation itself is not a bottleneck, comparing with tuple deform cost. The expectation for SUM(float4) is that you want speed and are prepared to cope with the consequences. It's easy enough to cast your input to float8 if you want a wider accumulator, or to numeric if you'd like more stable (not necessarily more accurate :-() results. I do not think it's the database's job to make those choices for you. From my point of your it is strange and wrong expectation. I am choosing "float4" type for a column just because it is enough to represent range of data I have and I need to minimize size of record. But when I am calculating sum, I expect to receive more or less precise result. Certainly I realize that even in case of using double it is possible to loose precision while calculation and result may depend on sum order (if we add very small and very larger values). But in real use cases (for example in trading data) such large difference in attribute values is very rare. If you have, for example, stock price, then it is very unlikely that one company has value 0.01 and another 1000.0 At least in TPC-H example (which certainly deal with dummy generated data), double type produce "almost price" result. regards, tom lane -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)
Hi, I wonder if it is possible to somehow benchmark your clustered index implementation. I tried to create VCI index for lineitem table from TPC and run Q6 query. After index creation Postgres is not using parallel execution plan any more but speed of sequential plan is not changed and nothing in query execution plan indicates that VCI index is used: postgres=# explain select sum(l_extendedprice*l_discount) as revenue from lineitem_projection where l_shipdate between '1996-01-01' and '1997-01-01' and l_discount between 0.08 and 0.1 and l_quantity < 24; QUERY PLAN --- - Finalize Aggregate (cost=608333.85..608333.86 rows=1 width=4) -> Gather (cost=608333.23..608333.84 rows=6 width=4) Workers Planned: 6 -> Partial Aggregate (cost=607333.23..607333.24 rows=1 width=4) -> Parallel Seq Scan on lineitem_projection (cost=0.00..607024.83 rows=61680 width=8) Filter: ((l_shipdate >= '1996-01-01'::date) AND (l_shipdate <= '1997-01-01'::date) AND (l_discount >= '0.08'::double precision) AN D (l_discount <= '0.1'::double precision) AND (l_quantity < '24'::double precision)) (6 rows) postgres=# select sum(l_extendedprice*l_discount) as revenue from lineitem_projection where l_shipdate between '1996-01-01' and '1997-01-01' and l_discount between 0.08 and 0.1 and l_quantity < 24; revenue - 6.2e+08 (1 row) Time: 1171.324 ms (00:01.171) postgres=# create index vci_idx on lineitem_projection using vci(l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus); CREATE INDEX Time: 4.705 ms postgres=# explain select * from lineitem_projection where l_shipdate between '1996-01-01' and '1997-01-01' and l_discount between 0.08 and 0.1 and l_quantity < 24; QUERY PLAN --- --- Seq Scan on lineitem_projection (cost=0.00..382077.00 rows=1 width=22) Filter: ((l_shipdate >= '1996-01-01'::date) AND (l_shipdate <= '1997-01-01'::date) AND (l_discount >= '0.08'::double precision) AND (l_discount <= ' 0.1'::double precision) AND (l_quantity < '24'::double precision)) (2 rows) postgres=# select sum(l_extendedprice*l_discount) as revenue from lineitem_projection where l_shipdate between '1996-01-01' and '1997-01-01' and l_discount between 0.08 and 0.1 and l_quantity < 24; revenue 6.2112e+08 (1 row) Time: 4304.355 ms (00:04.304) I wonder if there is any query which can demonstrate advantages of using VCI index? On 06.02.2017 04:26, Haribabu Kommi wrote: On Fri, Feb 3, 2017 at 8:28 PM, Konstantin Knizhnik <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote: On 30.12.2016 06:55, Haribabu Kommi wrote: Hi All, Fujitsu was interested in developing a columnar storage extension with minimal changes the server backend. We in PostgresPRO are also very interested in developing vertical storage (VS) for Postgres. And after considering many alternatives, we came to the conclusion that approach based on representing columnar store as access method (index) is the most promising one. It allows to: 1. Implement VS as extension without affecting Postgres core. 2. Have both ROS and WOS. 3. Create multiple projections (as in Vertica). 4. Optimize insert speed by support batch inserts and use flexible recovery model for VS. So it is very similar with your approach. But there are few differences: 1. Our intention is to completely eliminate changes in Postgres core. You wrote: Yes, it is a mix of both index and table access methods. The current design of Vertical clustered index needs both access methods, because of this reason we used both access methods. But I still do not completely understand why it is not possible to use VS in index only scans without any changes and standard Postgres executor? Why it is not possible to rely on standard rules of applying indexes in Postgres optimizer based on costs provided by our AM implementation? In our storage design, we used TID-CRID map to identify a record in heap to columnar storage. Because of HOT update, the new data will not be inserted into indexes, but this will give problem to the columnar storage, so we added a hook to insert index data even if the update is HOT. And also we added another hook for initializing the
[HACKERS] VOPS: vectorized executor for Postgres: how to speedup OLAP queries more than 10 times without changing anything in Postgres executor
Hello hackers, There were many discussions concerning possible ways of speeding-up Postgres. Different approaches were suggested: - JIT (now we have three different prototype implementations based on LLVM) - Chunked (vectorized) executor - Replacing pull with push - Columnar store (cstore_fdw, IMCS) - Optimizing and improving current executor (reducing tuple deform overhead, function call overhead,...) Obviously the best result can be achieved in case of combining all this approaches. But actually them are more or less interchangeable: vectorized execution is not eliminating interpretation overhead, but it is divided by vector size and becomes less critical. I decided to write small prototype to estimate possible speed improvement of vectorized executor. I created special types representing "tile" and implement standard SQL operators for them. So neither Postgres planer, nether Postgres executor, nether Postgres heap manager are changed. But I was able to reach more than 10 times speed improvement on TPC-H Q1/Q6 queries! Please find more information here: https://cdn.rawgit.com/postgrespro/vops/ddcbfbe6/vops.html The sources of the project can be found here: https://github.com/postgrespro/vops.git -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Sum aggregate calculation for single precsion real
Hi hackers, I wonder why SUM aggregate is calculated for real (float4) type using floating point accumulator? It cause very confusing and unexpected behavior: -- postgres=# select sum(l_quantity) from lineitem where l_shipdate <= '1998-12-01'; sum - 1.52688e+09 (1 row) postgres=# select sum(l_quantity+0.0) from lineitem where l_shipdate <= '1998-12-01'; sum 1529738036 It is specified in any SQL standard how aggregates should be calculated? At least Oracle and MS-SQL are calculating SUM for single precision type in different (and more natual) way. Are there are reasons of using float4pl function for SUM aggregate instead of float4_accum? Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Active zombies at AIX
I tried to rebuild Postgres without mmap and the problem disappears (pgbench with -C doesn't cause connection limit exhaustion any more). Unfortunately there is no any convenient way to configure PostgreSQL not to use mmap. I have to edit sysv_shmem.c file, commenting #ifndef EXEC_BACKEND #define USE_ANONYMOUS_SHMEM #endif I wonder why do we prohibit now configuration of Postgres without mmap? On 06.02.2017 12:47, Konstantin Knizhnik wrote: Last update on the problem. Using kdb tool (thank's to Tony Reix for advice and help) we get the following trace of Poastgres backend in existing stack trace: pvthread+073000 STACK: [005E1958]slock+000578 (005E1958, 80001032 [??]) [9558].simple_lock+58 () [00651DBC]vm_relalias+00019C (??, ??, ??, ??, ??) [006544AC]vm_map_entry_delete+00074C (??, ??, ??) [00659C30]vm_map_delete+000150 (??, ??, ??, ??) [00659D88]vm_map_deallocate+48 (??, ??) [0011C588]kexitx+001408 (??) [000BB08C]kexit+8C () ___ Recovery (FFF9290) ___ WARNING: Eyecatcher/version mismatch in RWA So there seems to be lock contention while unmapping memory segments. My assumption was that Postgres is detaching all attached segments before exit (in shmem_exit callback or earlier). I have added logging around proc_exit_prepare function (which is called by atexit callback) and check that it completes immediately. So I thought that this vm_map_deallocate can be related with deallocation of normal (malloced) memory, because in Linux memory allocator may use mmap. But in AIX it is not true. Below is report of Bergamini Demien (once again a lot of thanks for help with investigation the problem): The memory allocator in AIX libc does not use mmap and vm_relalias() is only called for shared memory mappings. I talked with the AIX VMM expert at IBM and he said that what you hit is one of the most common performance bottlenecks in AIX memory management. He also said that SysV Shared Memory (shmget/shmat) perform better on AIX than mmap. Some improvements have been made in AIX 6.1 (see “perf suffers when procs sharing the same segs all exit at once”: http://www-01.ibm.com/support/docview.wss?uid=isg1IZ83819) but it does not help in your case. In src/backend/port/sysv_shmem.c, it says that PostgreSQL 9.3 switched from using SysV Shared Memory to using mmap. Maybe you could try to switch back to using SysV Shared Memory on AIX to see if it helps performance-wise. Also, the good news is that there are some restricted tunables in AIX that can be tweaked to help different workloads which may have different demands. One of them is relalias_percentage which works with force_relalias_lite: # vmo -h relalias_percentage Help for tunable relalias_percentage: Purpose: If force_relalias_lite is set to 0, then this specifies the factor used in the heuristic to decide whether to avoid locking the source mmapped segment or not. Values: Default: 0 Range: 0 - 32767 Type: Dynamic Unit: Tuning: This is used when tearing down an mmapped region and is a scalability statement, where avoiding the lock may help system throughput, but, in some cases, at the cost of more compute time used. If the number of pages being unmapped is less than this value divided by 100 and multiplied by the total number of pages in memory in the source mmapped segment, then the source lock will be avoided. A value of 0 for relalias_percentage, with force_relalias_lite also set to 0, will cause the source segment lock to always be taken. Effective values for relalias_percentage will vary by workload, however, a suggested value is: 200. You may also try to play with the munmap_npages vmo tunable. Your vmo settings for lgpg_size, lgpg_regions and v_pinshm already seem correct. On 24.01.2017 18:08, Konstantin Knizhnik wrote: Hi hackers, Yet another story about AIX. For some reasons AIX very slowly cleaning zombie processes. If we launch pgbench with -C parameter then very soon limit for maximal number of connections is exhausted. If maximal number of connection is set to 1000, then after ten seconds of pgbench activity we get about 900 zombie processes and it takes about 100 seconds (!) before all of them are terminated. proctree shows a lot of defunt processes: [14:44:41]root@postgres:~ # proctree 26084446 26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc 4784362 4980786 11403448 11468930 11993176 12189710 12517390 13238374 13565974 13893826 postgres: wal writer process 14024716 15401000 ... 25691556 But ps shows that status of process is [14:46:02]root@postgres:~ # ps -elk | grep 25691556 * A - 25691556 - - - - - Breakpoint set in reaper() function in postmaster shows that each invocation of this functions (called by SIGCHLD handler) proceed 5-10 PIDS per invocation. So there are two hypothesis: either AIX is very slowly delivering SIGCHLD to parent, either exit
Re: [HACKERS] Active zombies at AIX
Last update on the problem. Using kdb tool (thank's to Tony Reix for advice and help) we get the following trace of Poastgres backend in existing stack trace: pvthread+073000 STACK: [005E1958]slock+000578 (005E1958, 80001032 [??]) [9558].simple_lock+58 () [00651DBC]vm_relalias+00019C (??, ??, ??, ??, ??) [006544AC]vm_map_entry_delete+00074C (??, ??, ??) [00659C30]vm_map_delete+000150 (??, ??, ??, ??) [00659D88]vm_map_deallocate+48 (??, ??) [0011C588]kexitx+001408 (??) [000BB08C]kexit+8C () ___ Recovery (FFF9290) ___ WARNING: Eyecatcher/version mismatch in RWA So there seems to be lock contention while unmapping memory segments. My assumption was that Postgres is detaching all attached segments before exit (in shmem_exit callback or earlier). I have added logging around proc_exit_prepare function (which is called by atexit callback) and check that it completes immediately. So I thought that this vm_map_deallocate can be related with deallocation of normal (malloced) memory, because in Linux memory allocator may use mmap. But in AIX it is not true. Below is report of Bergamini Demien (once again a lot of thanks for help with investigation the problem): The memory allocator in AIX libc does not use mmap and vm_relalias() is only called for shared memory mappings. I talked with the AIX VMM expert at IBM and he said that what you hit is one of the most common performance bottlenecks in AIX memory management. He also said that SysV Shared Memory (shmget/shmat) perform better on AIX than mmap. Some improvements have been made in AIX 6.1 (see “perf suffers when procs sharing the same segs all exit at once”: http://www-01.ibm.com/support/docview.wss?uid=isg1IZ83819) but it does not help in your case. In src/backend/port/sysv_shmem.c, it says that PostgreSQL 9.3 switched from using SysV Shared Memory to using mmap. Maybe you could try to switch back to using SysV Shared Memory on AIX to see if it helps performance-wise. Also, the good news is that there are some restricted tunables in AIX that can be tweaked to help different workloads which may have different demands. One of them is relalias_percentage which works with force_relalias_lite: # vmo -h relalias_percentage Help for tunable relalias_percentage: Purpose: If force_relalias_lite is set to 0, then this specifies the factor used in the heuristic to decide whether to avoid locking the source mmapped segment or not. Values: Default: 0 Range: 0 - 32767 Type: Dynamic Unit: Tuning: This is used when tearing down an mmapped region and is a scalability statement, where avoiding the lock may help system throughput, but, in some cases, at the cost of more compute time used. If the number of pages being unmapped is less than this value divided by 100 and multiplied by the total number of pages in memory in the source mmapped segment, then the source lock will be avoided. A value of 0 for relalias_percentage, with force_relalias_lite also set to 0, will cause the source segment lock to always be taken. Effective values for relalias_percentage will vary by workload, however, a suggested value is: 200. You may also try to play with the munmap_npages vmo tunable. Your vmo settings for lgpg_size, lgpg_regions and v_pinshm already seem correct. On 24.01.2017 18:08, Konstantin Knizhnik wrote: Hi hackers, Yet another story about AIX. For some reasons AIX very slowly cleaning zombie processes. If we launch pgbench with -C parameter then very soon limit for maximal number of connections is exhausted. If maximal number of connection is set to 1000, then after ten seconds of pgbench activity we get about 900 zombie processes and it takes about 100 seconds (!) before all of them are terminated. proctree shows a lot of defunt processes: [14:44:41]root@postgres:~ # proctree 26084446 26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc 4784362 4980786 11403448 11468930 11993176 12189710 12517390 13238374 13565974 13893826 postgres: wal writer process 14024716 15401000 ... 25691556 But ps shows that status of process is [14:46:02]root@postgres:~ # ps -elk | grep 25691556 * A - 25691556 - - - - - Breakpoint set in reaper() function in postmaster shows that each invocation of this functions (called by SIGCHLD handler) proceed 5-10 PIDS per invocation. So there are two hypothesis: either AIX is very slowly delivering SIGCHLD to parent, either exit of process takes too much time. The fact the backends are in exiting state makes second hypothesis more reliable. We have tried different Postgres configurations with local and TCP sockets, with different amount of shared buffers and built both with gcc and xlc. In all cases behavior is similar: zombies do not want to die. As far as it is not possible to attach debugger to defunct process, it is not clear how to understand what's going on. I
Re: [HACKERS] logical decoding of two-phase transactions
On 02/04/2017 03:08 AM, Andres Freund wrote: > On 2017-02-03 18:47:23 -0500, Robert Haas wrote: >> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <and...@anarazel.de> wrote: >>> I still haven't seen a credible model for being able to apply a stream >>> of interleaved transactions that can roll back individually; I think we >>> really need the ability to have multiple transactions alive in one >>> backend for that. >> Hmm, yeah, that's a problem. That smells like autonomous transactions. > Unfortunately the last few proposals, like spawning backends, to deal > with autonomous xacts aren't really suitable for replication, unless you > only have very large ones. And it really needs to be an implementation > where ATs can freely be switched inbetween. On the other hand, a good > deal of problems (like locking) shouldn't be an issue, since there's > obviously a possible execution schedule. > > I suspect this'd need some low-level implemention close to xact.c that'd > allow switching between transactions. Let me add my two coins here: 1. We are using logical decoding in our multimaster and applying transactions concurrently by pool of workers. Unlike asynchronous replication, in multimaster we need to perform voting for each transaction commit, so if transactions are applied by single workers, then performance will be awful and, moreover, there is big chance to get "deadlock" when none of workers can complete voting because different nodes are performing voting for different transactions. I could not say that there are no problems with this approach. There are definitely a lot of challenges. First of all we need to use special DTM (distributed transaction manager) to provide consistent applying of transaction at different nodes. Second problem is once again related with kind of "deadlock" explained above. Even if we apply transactions concurrently, it is still possible to get such deadlock if we do not have enough workers. This is why we allow to launch extra workers dynamically (but finally it is limited by maximal number of configures bgworkers). But in any case, I think that "parallel apply" is "must have" mode for logical replication. 2. We have implemented autonomous transactions in PgPro EE. Unlike proposal currently present at commit fest, we execute autonomous transaction within the same backend. So we are just storing and restoring transaction context. Unfortunately it is also not so cheap operation. Autonomous transaction should not see any changes done by parent transaction (because it can be rollbacked after commit of autonomous transaction). But there are catalog and relation caches inside backend, so we have to clean this caches before switching to ATX. It is quite expensive operation and so speed of execution of PL/pg-SQL function with autonomous transaction is several order of magnitude slower than without it. So autonomous transaction can be used for audits (its the primary goal of using ATX in Oracle PL/SQL applications) but this mechanism is not efficient for concurrent execution of multiple transaction in one backend. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)
, during the select operation, the data in WOS is converted into Local ROS for the statement to be executed. The conversion cost depends upon the number of tuples present in the WOS file. This may add some performance overhead for select statements. The life of the Local ROS is till the end of query context. Delete = During the delete operation, whenever the data is deleted in heap at the same time the data in WOS file is marked as deleted similar like heap. But in case if the data is already migrated from WOS to ROS, then we will maintain some delete vector to store the details of tuple id, transaction information and etc. During the data read from ROS file, it is verified against delete vector and confirms whether the record is visible or not? All the delete vectors data is applied to ROS periodically. More details of internal relations and their usage is available in the README. Still it needs more updates to explain full details of the columnar index design. The concept of Vertical clustered index columnar extension is from Fujitsu Labs, Japan. Following is the brief schedule of patches that are required for a better performing columnar store. 1. Minimal server changes (new relkind "CSTORE" option) 2. Base storage patch 3. Support for moving data from WOS to ROS 4. Local ROS support 5. Custom scan support to read the data from ROS and Local ROS 6. Background worker support for data movement 7. Expression state support in VCI 8. Aggregation support in VCI 9. Pg_dump support for the new type of relations 10. psql \d command support for CSTORE relations 11. Parallelism support 12. Compression support 13. In-memory support with dynamic shared memory Currently I attached only patches for 1 and 2. These will provide the basic changes that are required in PostgreSQL core to the extension to work. I have to rebase/rewrite the rest of the patches to the latest master, and share them with community. Any Comments on the approach? Regards, Hari Babu Fujitsu Australia -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Deadlock in XLogInsert at AIX
Hi Tony, On 02.02.2017 17:10, REIX, Tony wrote: Hi Konstantin I've discussed the "zombie/exit" issue with our expert here. - He does not think that AIX has anything special here - If the process is marked in ps, this is because the flag SEXIT is set, thus the process is blocked somewhere in the kexitx() syscall, waiting for something. - In order to know what it is waiting for, the best would be to have a look with *kdb*. kdb shows the following stack: pvthread+073000 STACK: [005E1958]slock+000578 (005E1958, 80001032 [??]) [9558].simple_lock+58 () [00651DBC]vm_relalias+00019C (??, ??, ??, ??, ??) [006544AC]vm_map_entry_delete+00074C (??, ??, ??) [00659C30]vm_map_delete+000150 (??, ??, ??, ??) [00659D88]vm_map_deallocate+48 (??, ??) [0011C588]kexitx+001408 (??) [000BB08C]kexit+8C () ___ Recovery (FFF9290) ___ WARNING: Eyecatcher/version mismatch in RWA -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Deadlock in XLogInsert at AIX
staff 0 Feb 2 09:09 3256 -rw---1 pgstbf staff 0 Feb 2 09:09 3256_vm -rw---1 pgstbf staff 8192 Feb 2 09:09 3257 -rw---1 pgstbf staff 8192 Feb 2 09:09 3258 -rw---1 pgstbf staff 16384 Feb 2 09:09 3394 -rw---1 pgstbf staff 24576 Feb 2 09:09 3394_fsm -rw---1 pgstbf staff 8192 Feb 2 09:09 3394_vm -rw---1 pgstbf staff 16384 Feb 2 09:09 3395 -rw---1 pgstbf staff 32768 Feb 2 09:14 3455 -rw---1 pgstbf staff 8192 Feb 2 09:09 3456 -rw---1 pgstbf staff 24576 Feb 2 09:09 3456_fsm -rw---1 pgstbf staff 8192 Feb 2 09:09 3456_vm -rw---1 pgstbf staff 0 Feb 2 09:09 3466 -rw---1 pgstbf staff 0 Feb 2 09:09 3466_vm -rw---1 pgstbf staff 8192 Feb 2 09:09 3467 -rw---1 pgstbf staff 8192 Feb 2 09:09 3468 -rw---1 pgstbf staff 0 Feb 2 09:09 3501 -rw---1 pgstbf staff 0 Feb 2 09:09 3501_vm -rw---1 pgstbf staff 8192 Feb 2 09:09 3502 -rw---1 pgstbf staff 8192 Feb 2 09:09 3503 -rw---1 pgstbf staff 8192 Feb 2 09:09 3534 -rw---1 pgstbf staff 8192 Feb 2 09:09 3541 -rw---1 pgstbf staff 24576 Feb 2 09:09 3541_fsm -rw---1 pgstbf staff 8192 Feb 2 09:09 3541_vm -rw---1 pgstbf staff 16384 Feb 2 09:09 3542 -rw---1 pgstbf staff 8192 Feb 2 09:09 3574 -rw---1 pgstbf staff 8192 Feb 2 09:09 3575 -rw---1 pgstbf staff 0 Feb 2 09:09 3576 -rw---1 pgstbf staff 0 Feb 2 09:09 3576_vm -rw---1 pgstbf staff 0 Feb 2 09:09 3596 -rw---1 pgstbf staff 0 Feb 2 09:09 3596_vm -rw---1 pgstbf staff 8192 Feb 2 09:09 3597 -rw---1 pgstbf staff 0 Feb 2 09:09 3598 -rw---1 pgstbf staff 0 Feb 2 09:09 3598_vm -rw---1 pgstbf staff 8192 Feb 2 09:09 3599 -rw---1 pgstbf staff 8192 Feb 2 09:09 3600 -rw---1 pgstbf staff 24576 Feb 2 09:09 3600_fsm -rw---1 pgstbf staff 8192 Feb 2 09:09 3600_vm -rw---1 pgstbf staff 8192 Feb 2 09:09 3601 -rw---1 pgstbf staff 24576 Feb 2 09:09 3601_fsm -rw---1 pgstbf staff 8192 Feb 2 09:09 3601_vm -rw---1 pgstbf staff 8192 Feb 2 09:09 3602 -rw---1 pgstbf staff 24576 Feb 2 09:09 3602_fsm -rw---1 pgstbf staff 8192 Feb 2 09:09 3602_vm -rw---1 pgstbf staff 16384 Feb 2 09:09 3603 -rw---1 pgstbf staff 24576 Feb 2 09:09 3603_fsm -rw---1 pgstbf staff 8192 Feb 2 09:09 3603_vm -rw---1 pgstbf staff 16384 Feb 2 09:09 3604 -rw---1 pgstbf staff 16384 Feb 2 09:09 3605 -rw---1 pgstbf staff 16384 Feb 2 09:09 3606 -rw---1 pgstbf staff 16384 Feb 2 09:09 3607 -rw---1 pgstbf staff 16384 Feb 2 09:09 3608 -rw---1 pgstbf staff 32768 Feb 2 09:09 3609 -rw---1 pgstbf staff 16384 Feb 2 09:09 3712 -rw---1 pgstbf staff 8192 Feb 2 09:09 3764 -rw---1 pgstbf staff 24576 Feb 2 09:09 3764_fsm -rw---1 pgstbf staff 8192 Feb 2 09:09 3764_vm -rw---1 pgstbf staff 16384 Feb 2 09:09 3766 -rw---1 pgstbf staff 16384 Feb 2 09:09 3767 -rw---1 pgstbf staff 8192 Feb 2 09:09 548 -rw---1 pgstbf staff 8192 Feb 2 09:09 549 -rw---1 pgstbf staff 0 Feb 2 09:09 826 -rw---1 pgstbf staff 0 Feb 2 09:09 826_vm -rw---1 pgstbf staff 8192 Feb 2 09:09 827 -rw---1 pgstbf staff 8192 Feb 2 09:09 828 -rw---1 pgstbf staff 4 Feb 2 09:09 PG_VERSION -rw---1 pgstbf staff 512 Feb 2 09:09 pg_filenode.map -rw---1 pgstbf staff112660 Feb 2 09:09 pg_internal.init Le 01/02/2017 à 21:26, Konstantin Knizhnik a écrit : On 02/01/2017 08:30 PM, REIX, Tony wrote: Hi Konstantin, If you could*share scripts or instructions about the tests you are doing with pgbench*, I would reproduce here. You do not need any script. Just two simple commands. One to initialize database: pgbench -i -s 1000 And another to run benchmark itself: pgbench -c 100 -j 20 -P 1 -T 10 ... Regards, Tony Le 01/02/2017 à 16:59, Konstantin Knizhnik a écrit : Hi Tony, On 01.02.2017 18:42, REIX, Tony wrote: Hi Konstantin *XLC.* I'm on AIX 7.1 for now. I'm using this version of
Re: [HACKERS] Deadlock in XLogInsert at AIX
IPRA.$BackendRun() at 0x1001301ac BackendStartup() at 0x10012f784 postmaster.IPRA.$do_start_bgworker.IPRA.$ServerLoop() at 0x10012fec4 PostmasterMain() at 0x100134760 main() at 0x106ac (dbx) where send(??, ??, ??, ??) at 0x910cfb4 secure_write() at 0x100141a94 IPRA.$internal_flush() at 0x10013e800 socket_flush() at 0x10013ee7c ReadyForQuery@AF106_4() at 0x10033241c PostgresMain() at 0x10006eb7c IPRA.$BackendRun() at 0x1001301ac BackendStartup() at 0x10012f784 postmaster.IPRA.$do_start_bgworker.IPRA.$ServerLoop() at 0x10012fec4 PostmasterMain() at 0x100134760 main() at 0x106ac (dbx) cont execution completed (exit code 1) (dbx) quit ^C (dbx) quit libdebug assertion "(rc == DB_SUCCESS)" failed at line 162 in file ../../../../../../../../../../../src/bos/usr/ccs/lib/libdbx/libdebug/modules/procdebug/ptrace/procdb_PtraceSession.C I have no idea what's going on. It is release version without debug information, assert checks and lwlock info. So it is hard to debug it. Heikki, I will be pleased if you have a chance to login at the system and look at it yourself. May be you will have some idea what's happening... Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Deadlock in XLogInsert at AIX
On 02/01/2017 08:28 PM, Heikki Linnakangas wrote: > > But if there's no pressing reason to change it, let's leave it alone. It's > not related to the problem at hand, right? > Yes, I agree with you: we should better leave it as it is. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Deadlock in XLogInsert at AIX
On 02/01/2017 08:30 PM, REIX, Tony wrote: > > Hi Konstantin, > > Please run:*/opt/IBM/xlc/13.1.3/bin/xlc -qversion* so that I know your exact > XLC v13 version. > IBM XL C/C++ for AIX, V13.1.3 (5725-C72, 5765-J07) > I'm building on Power7 and not giving any architecture flag to XLC. > > I'm not using *-qalign=natural* . Thus, by default, XLC use -qalign=power, > which is close to natural, as explained at: > > https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.0/com.ibm.xlc131.aix.doc/compiler_ref/opt_align.html > Why are you using this flag ? > Because otherwise double type is aligned on 4 bytes. > Thanks for info about *pgbench*. PostgreSQL web-site contains a lot of old > information... > > If you could*share scripts or instructions about the tests you are doing with > pgbench*, I would reproduce here. > You do not need any script. Just two simple commands. One to initialize database: pgbench -i -s 1000 And another to run benchmark itself: pgbench -c 100 -j 20 -P 1 -T 10 > I have no "real" application. My job consists in porting OpenSource packages > on AIX. Many packages. Erlang, Go, these days. I just want to make PostgreSQL > RPMs as good as possible... within the limited amount of time I can give to > this package, before > moving to another one. > > About the *zombie* issue, I've discussed with my colleagues. Looks like the > process keeps zombie till the father looks at its status. However, though I > did that several times, I do not remember well the details. And that should > be not specific to AIX. > I'll discuss with another colleague, tomorrow, who should understand this > better than me. > 1. Process is not in zomby state (according to ps). It is in state... It is something AIX specific, I have not see processes in this state at Linux. 2. I have implemented simple test - forkbomb. It creates 1000 children and then wait for them. It is about ten times slower than at Intel/Linux, but still much faster than 100 seconds. So there is some difference between postgress backend and dummy process doing nothing - just immediately terminating after return from fork() > > *Patch for Large Files*: When building PostgreSQL, I found required to use > the following patch so that PostgreSQL works with large files. I do not > remember the details. Do you agree with such a patch ? 1rst version (new-...) > shows the exact places where > define _LARGE_FILES 1 is required. 2nd version (new2-...) is simpler. > > I'm now experimenting with your patch for dead lock. However, that should be > invisible with the "check-world" tests I guess. > > Regards, > > Tony > > > Le 01/02/2017 à 16:59, Konstantin Knizhnik a écrit : >> Hi Tony, >> >> On 01.02.2017 18:42, REIX, Tony wrote: >>> >>> Hi Konstantin >>> >>> *XLC.* >>> >>> I'm on AIX 7.1 for now. >>> >>> I'm using this version of *XL**C v13*: >>> >>> # xlc -qversion >>> IBM XL C/C++ for AIX, V13.1.3 (5725-C72, 5765-J07) >>> Version: 13.01.0003.0003 >>> >>> With this version, I have (at least, since I tested with "check" and not >>> "check-world" at that time) 2 failing tests: create_aggregate , aggregates . >>> >>> >>> With the following *XLC v12* version, I have NO test failure: >>> >>> # /usr/vac/bin/xlc -qversion >>> IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72) >>> Version: 12.01..0016 >>> >>> >>> So maybe you are not using XLC v13.1.3.3, rather another sub-version. >>> Unless you are using more options for the configure ? >>> >>> >>> *Configure*. >>> >>> What are the options that you give to the configure ? >>> >>> >> export CC="/opt/IBM/xlc/13.1.3/bin/xlc" >> export CFLAGS="-qarch=pwr8 -qtune=pwr8 -O2 -qalign=natural -q64 " >> export LDFLAGS="-Wl,-bbigtoc,-b64" >> export AR="/usr/bin/ar -X64" >> export LD="/usr/bin/ld -b64 " >> export NM="/usr/bin/nm -X64" >> ./configure --prefix="/opt/postgresql/xlc-debug/9.6" >> >> >>> *Hard load & 64 cores ?* OK. That clearly explains why I do not see this >>> issue. >>> >>> >>> *pgbench ?* I wanted to run it. However, I'm still looking where to get it >>> plus a guide for using it for testing. >>> >> >> pgbench is part of Postgres distributive (src/bin/pgbench) >> >> >>> I would add such tests when building my PostgreSQL RPMs on AIX. So any
Re: [HACKERS] logical decoding of two-phase transactions
On 02/01/2017 10:32 PM, Tom Lane wrote: > Robert Haas <robertmh...@gmail.com> writes: >> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED >> doesn't seem inordinately expensive to me. > I'm confused ... isn't it there already? If not, how do we handle > reconstructing 2PC state from WAL at all? > > regards, tom lane > > Right now logical decoding ignores prepare and take in account only "commit prepared": /* * Currently decoding ignores PREPARE TRANSACTION and will just * decode the transaction when the COMMIT PREPARED is sent or * throw away the transaction's contents when a ROLLBACK PREPARED * is received. In the future we could add code to expose prepared * transactions in the changestream allowing for a kind of * distributed 2PC. */ For some scenarios it works well, but if we really need prepared state at replica (as in case of multimaster), then it is not enough. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Deadlock in XLogInsert at AIX
On 01.02.2017 15:39, Heikki Linnakangas wrote: In summary, I came up with the attached. It's essentially your patch, with tweaks for the above-mentioned things. I don't have a powerpc system to test on, so there are probably some silly typos there. Attached pleased find fixed version of your patch. I verified that it is correctly applied, build and postgres normally works with it. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/include/port/atomics/arch-ppc.h b/src/include/port/atomics/arch-ppc.h index ed1cd9d1b9..7cf8c8ef97 100644 --- a/src/include/port/atomics/arch-ppc.h +++ b/src/include/port/atomics/arch-ppc.h @@ -23,4 +23,11 @@ #define pg_memory_barrier_impl() __asm__ __volatile__ ("sync" : : : "memory") #define pg_read_barrier_impl() __asm__ __volatile__ ("lwsync" : : : "memory") #define pg_write_barrier_impl() __asm__ __volatile__ ("lwsync" : : : "memory") + +#elif defined(__IBMC__) || defined(__IBMCPP__) + +#define pg_memory_barrier_impl()__sync() +#define pg_read_barrier_impl() __lwsync() +#define pg_write_barrier_impl() __lwsync() + #endif diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h index f854612d39..e1dd3310a5 100644 --- a/src/include/port/atomics/generic-xlc.h +++ b/src/include/port/atomics/generic-xlc.h @@ -48,7 +48,7 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, * consistency only, do not use it here. GCC atomics observe the same * restriction; see its rs6000_pre_atomic_barrier(). */ - __asm__ __volatile__ (" sync \n" ::: "memory"); + __sync(); /* * XXX: __compare_and_swap is defined to take signed parameters, but that @@ -73,11 +73,19 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, static inline uint32 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_) { + uint32 ret; + /* - * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby - * providing sequential consistency. This is undocumented. + * Use __sync() before and __isync() after, like in compare-exchange + * above. */ - return __fetch_and_add((volatile int *)>value, add_); + __sync(); + + ret = __fetch_and_add((volatile int *)>value, add_); + + __isync(); + + return ret; } #ifdef PG_HAVE_ATOMIC_U64_SUPPORT @@ -89,7 +97,7 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, { bool ret; - __asm__ __volatile__ (" sync \n" ::: "memory"); + __sync(); ret = __compare_and_swaplp((volatile long*)>value, (long *)expected, (long)newval); @@ -103,7 +111,15 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, static inline uint64 pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_) { - return __fetch_and_addlp((volatile long *)>value, add_); + uint64 ret; + + __sync(); + + ret = __fetch_and_addlp((volatile long *)>value, add_); + + __isync(); + + return ret; } #endif /* PG_HAVE_ATOMIC_U64_SUPPORT */ diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h index 7aad2de..c6ef114 100644 --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -832,9 +831,8 @@ typedef unsigned int slock_t; #include typedef int slock_t; - -#define TAS(lock) _check_lock((slock_t *) (lock), 0, 1) -#define S_UNLOCK(lock) _clear_lock((slock_t *) (lock), 0) +#define TAS(lock) __check_lock_mp((slock_t *) (lock), 0, 1) +#define S_UNLOCK(lock) __clear_lock_mp((slock_t *) (lock), 0) #endif /* _AIX */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Deadlock in XLogInsert at AIX
Hi Tony, On 01.02.2017 18:42, REIX, Tony wrote: Hi Konstantin *XLC.* I'm on AIX 7.1 for now. I'm using this version of *XL**C v13*: # xlc -qversion IBM XL C/C++ for AIX, V13.1.3 (5725-C72, 5765-J07) Version: 13.01.0003.0003 With this version, I have (at least, since I tested with "check" and not "check-world" at that time) 2 failing tests: create_aggregate , aggregates . With the following *XLC v12* version, I have NO test failure: # /usr/vac/bin/xlc -qversion IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72) Version: 12.01..0016 So maybe you are not using XLC v13.1.3.3, rather another sub-version. Unless you are using more options for the configure ? *Configure*. What are the options that you give to the configure ? export CC="/opt/IBM/xlc/13.1.3/bin/xlc" export CFLAGS="-qarch=pwr8 -qtune=pwr8 -O2 -qalign=natural -q64 " export LDFLAGS="-Wl,-bbigtoc,-b64" export AR="/usr/bin/ar -X64" export LD="/usr/bin/ld -b64 " export NM="/usr/bin/nm -X64" ./configure --prefix="/opt/postgresql/xlc-debug/9.6" *Hard load & 64 cores ?* OK. That clearly explains why I do not see this issue. *pgbench ?* I wanted to run it. However, I'm still looking where to get it plus a guide for using it for testing. pgbench is part of Postgres distributive (src/bin/pgbench) I would add such tests when building my PostgreSQL RPMs on AIX. So any help is welcome ! *Performance*. - Also, I'd like to compare PostgreSQL performance on AIX vs Linux/PPC64. Any idea how I should proceed ? Any PostgreSQL performance benchmark that I could find and use ? pgbench ? pgbench is most widely used tool simulating OLTP workload. Certainly it is quite primitive and its results are rather artificial. TPC-C seems to be better choice. But the best case is to implement your own benchmark simulating actual workload of your real application. - I'm interested in any information for improving the performance & quality of my PostgreSQM RPMs on AIX./(As I already said, BullFreeware RPMs for AIX are free and can be used by anyone, like Perzl RPMs are//.My compa//ny (ATOS/Bull) sells IBM Power machines under the Escala brand s//ince ages (25 years this year)//)/. *How to help ?* How could I help for improving the quality and performance of PostgreSQL on AIX ? We still have one open issue at AIX: see https://www.mail-archive.com/pgsql-hackers@postgresql.org/msg303094.html It will be great if you can somehow help to fix this problem. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Deadlock in XLogInsert at AIX
Hi, We are using 13.1.3 version of XLC. All tests are passed. Please notice that is is synchronization bug which can be reproduced only under hard load. Our server has 64 cores and it is necessary to run pgbench with 100 connections during several minutes to reproduce the problem. So may be you just didn't notice it;) On 01.02.2017 16:29, REIX, Tony wrote: Hi, I'm now working on the port of PostgreSQL on AIX. (RPMs can be found, as free OpenSource work, at http://http://bullfreeware.com/ <http://http://bullfreeware.com/> . http://bullfreeware.com/search.php?package=postgresql) I was not aware of any issue with XLC v12 on AIX for atomic operations. (XLC v13 generates at least 2 tests failures) For now, with version 9.6.1, all tests "check-world", plus numeric_bigtest, are OK, in both 32 & 64bit versions. Am I missing something ? I configure the build of PostgreSQL with (in 64bits): ./configure --prefix=/opt/freeware --libdir=/opt/freeware/lib64 --mandir=/opt/freeware/man --with-perl --with-tcl --with-tclconfig=/opt/freeware/lib --with-python --with-ldap --with-openssl --with-libxml --with-libxslt --enable-nls --enable-thread-safety --sysconfdir=/etc/sysconfig/postgresql Am I missing some option for more optimization on AIX ? Thanks Regards, Tony Le 01/02/2017 à 12:07, Konstantin Knizhnik a écrit : Attached please find my patch for XLC/AIX. The most critical fix is adding __sync to pg_atomic_fetch_add_u32_impl. The comment in this file says that: * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby * providing sequential consistency. This is undocumented. But it is not true any more (I checked generated assembler code in debugger). This is why I have added __sync() to this function. Now pgbench working normally. Also there is mysterious disappearance of assembler section function with sync instruction from pg_atomic_compare_exchange_u32_impl. I have fixed it by using __sync() built-in function instead. Thanks to everybody who helped me to locate and fix this problem. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company ATOS WARNING ! This message contains attachments that could potentially harm your computer. Please make sure you open ONLY attachments from senders you know, trust and is in an e-mail that you are expecting. AVERTISSEMENT ATOS ! Ce message contient des pièces jointes qui peuvent potentiellement endommager votre ordinateur. Merci de vous assurer que vous ouvrez uniquement les pièces jointes provenant d’emails que vous attendez et dont vous connaissez les expéditeurs et leur faites confiance. AVISO DE ATOS ! Este mensaje contiene datos adjuntos que pudiera ser que dañaran su ordenador. Asegúrese de abrir SOLO datos adjuntos enviados desde remitentes de confianza y que procedan de un correo esperado. ATOS WARNUNG ! Diese E-Mail enthält Anlagen, welche möglicherweise ihren Computer beschädigen könnten. Bitte beachten Sie, daß Sie NUR Anlagen öffnen, von einem Absender den Sie kennen, vertrauen und vom dem Sie vor allem auch E-Mails mit Anlagen erwarten. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Deadlock in XLogInsert at AIX
On 01.02.2017 15:39, Heikki Linnakangas wrote: On 02/01/2017 01:07 PM, Konstantin Knizhnik wrote: Attached please find my patch for XLC/AIX. The most critical fix is adding __sync to pg_atomic_fetch_add_u32_impl. The comment in this file says that: * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby * providing sequential consistency. This is undocumented. But it is not true any more (I checked generated assembler code in debugger). This is why I have added __sync() to this function. Now pgbench working normally. Seems like it was not so much undocumented, but an implementation detail that was not guaranteed after all.. Does __fetch_and_add emit a trailing isync there either? Seems odd if __compare_and_swap requires it, but __fetch_and_add does not. Unless we can find conclusive documentation on that, I think we should assume that an __isync() is required, too. There was a long thread on these things the last time this was changed: https://www.postgresql.org/message-id/20160425185204.jrvlghn3jxulsb7i%40alap3.anarazel.de. I couldn't find an explanation there of why we thought that fetch_and_add implicitly performs sync and isync. Also there is mysterious disappearance of assembler section function with sync instruction from pg_atomic_compare_exchange_u32_impl. I have fixed it by using __sync() built-in function instead. __sync() seems more appropriate there, anyway. We're using intrinsics for all the other things in generic-xlc.h. But it sure is scary that the "asm" sections just disappeared. In arch-ppc.h, shouldn't we have #ifdef __IBMC__ guards for the __sync() and __lwsync() intrinsics? Those are an xlc compiler-specific thing, right? Or if they are expected to work on any ppc compiler, then we should probably use them always, instead of the asm sections. In summary, I came up with the attached. It's essentially your patch, with tweaks for the above-mentioned things. I don't have a powerpc system to test on, so there are probably some silly typos there. Why do you prefer to use _check_lock instead of __check_lock_mp ? First one is even not mentioned in XLC compiler manual: http://www-01.ibm.com/support/docview.wss?uid=swg27046906=7 or http://scv.bu.edu/computation/bluegene/IBMdocs/compiler/xlc-8.0/html/compiler/ref/bif_sync.htm - Heikki -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Deadlock in XLogInsert at AIX
Attached please find my patch for XLC/AIX. The most critical fix is adding __sync to pg_atomic_fetch_add_u32_impl. The comment in this file says that: * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby * providing sequential consistency. This is undocumented. But it is not true any more (I checked generated assembler code in debugger). This is why I have added __sync() to this function. Now pgbench working normally. Also there is mysterious disappearance of assembler section function with sync instruction from pg_atomic_compare_exchange_u32_impl. I have fixed it by using __sync() built-in function instead. Thanks to everybody who helped me to locate and fix this problem. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/include/port/atomics/arch-ppc.h b/src/include/port/atomics/arch-ppc.h index 2b54c42..5828f7e 100644 --- a/src/include/port/atomics/arch-ppc.h +++ b/src/include/port/atomics/arch-ppc.h @@ -23,4 +23,10 @@ #define pg_memory_barrier_impl() __asm__ __volatile__ ("sync" : : : "memory") #define pg_read_barrier_impl() __asm__ __volatile__ ("lwsync" : : : "memory") #define pg_write_barrier_impl() __asm__ __volatile__ ("lwsync" : : : "memory") + +#else +#define pg_memory_barrier_impl()__sync() +#define pg_read_barrier_impl() __lwsync() +#define pg_write_barrier_impl() __lwsync() + #endif diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h index f4fd2f3..531d17c 100644 --- a/src/include/port/atomics/generic-xlc.h +++ b/src/include/port/atomics/generic-xlc.h @@ -36,7 +36,8 @@ typedef struct pg_atomic_uint64 #endif /* __64BIT__ */ #define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32 static inline bool pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, uint32 *expected, uint32 newval) { @@ -48,14 +49,14 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr, * consistency only, do not use it here. GCC atomics observe the same * restriction; see its rs6000_pre_atomic_barrier(). */ - __asm__ __volatile__ (" sync \n" ::: "memory"); + __sync(); /* * XXX: __compare_and_swap is defined to take signed parameters, but that * shouldn't matter since we don't perform any arithmetic operations. */ ret = __compare_and_swap((volatile int*)>value, @@ -77,6 +78,7 @@ pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_) * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby * providing sequential consistency. This is undocumented. */ + __sync(); return __fetch_and_add((volatile int *)>value, add_); } @@ -89,10 +91,10 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, { bool ret; - __asm__ __volatile__ (" sync \n" ::: "memory"); + __sync(); ret = __compare_and_swaplp((volatile long*)>value, (long *)expected, (long)newval); __isync(); @@ -103,7 +105,8 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr, static inline uint64 pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_) { - return __fetch_and_addlp((volatile long *)>value, add_); +__sync(); +return __fetch_and_addlp((volatile long *)>value, add_); } #endif /* PG_HAVE_ATOMIC_U64_SUPPORT */ diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h index 7aad2de..c6ef114 100644 --- a/src/include/storage/s_lock.h +++ b/src/include/storage/s_lock.h @@ -832,9 +831,8 @@ typedef unsigned int slock_t; #include typedef int slock_t; - -#define TAS(lock) _check_lock((slock_t *) (lock), 0, 1) -#define S_UNLOCK(lock) _clear_lock((slock_t *) (lock), 0) +#define TAS(lock) __check_lock_mp((slock_t *) (lock), 0, 1) +#define S_UNLOCK(lock) __clear_lock_mp((slock_t *) (lock), 0) #endif /* _AIX */ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Deadlock in XLogInsert at AIX
ction begins immediately after __compare_and_swap(). */ __isync(); return ret; } and if I compile this fuctions standalone, I get the following assembler code: .pg_atomic_compare_exchange_u32_impl: # 0x (H.4.NO_SYMBOL) stdu SP,-128(SP) stdr3,176(SP) stdr4,184(SP) stdr5,192(SP) ld r0,192(SP) stwr0,192(SP) sync ld r4,176(SP) ld r3,184(SP) lwzr0,192(SP) extsw r0,r0 lwar5,0(r3) __L30: # 0x0030 (H.4.NO_SYMBOL+0x030) lwarx r6,r0,r4 cmpl 0,0,r5,r6 bc BO_IF_NOT,CR0_EQ,__L44 stwcx. r0,r0,r4 .machine"any" bc BO_IF_NOT_3,CR0_EQ,__L30 __L44: # 0x0044 (H.4.NO_SYMBOL+0x044) orir0,r6,0x stwr0,0(r3) mfcr r0 rlwinm r0,r0,3,31,31 rldicl r0,r0,0,56 stbr0,112(SP) isync lbzr3,112(SP) addi SP,SP,128 bclr BO_ALWAYS,CR0_LT sync is here! -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Deadlock in XLogInsert at AIX
On 30.01.2017 19:21, Heikki Linnakangas wrote: On 01/24/2017 04:47 PM, Konstantin Knizhnik wrote: Interesting.. What should happen here is that for the backend's own insertion slot, the "insertingat" value should be greater than the requested flush point ('upto' variable). That's because before GetXLogBuffer() calls AdvanceXLInsertBuffer(), it updates the backend's insertingat value, to the position that it wants to insert to. And AdvanceXLInsertBuffer() only calls WaitXLogInsertionsToFinish() with value smaller than what was passed as the 'upto' argument. The comment to WaitXLogInsertionsToFinish says: * Note: When you are about to write out WAL, you must call this function * *before* acquiring WALWriteLock, to avoid deadlocks. This function might * need to wait for an insertion to finish (or at least advance to next * uninitialized page), and the inserter might need to evict an old WAL buffer * to make room for a new one, which in turn requires WALWriteLock. Which contradicts to the observed stack trace. Not AFAICS. In the stack trace you showed, the backend is not holding WALWriteLock. It would only acquire it after the WaitXLogInsertionsToFinish() call finished. Hmmm, may be I missed something. I am not telling about WALBufMappingLock which is required after return from XLogInsertionsToFinish. But about lock obtained by WALInsertLockAcquire at line 946 in XLogInsertRecord. It will be release at line 1021 by WALInsertLockRelease(). But CopyXLogRecordToWAL is invoked with this lock granted. This line in the stack trace is suspicious: WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c" AdvanceXLInsertBuffer() should only ever call WaitXLogInsertionsToFinish() with an xlog position that points to a page bounary, but that upto value points to the middle of a page. Perhaps the value stored in the stack trace is not what the caller passed, but it was updated because it was past the 'reserveUpto' value? That would explain the "request to flush past end of generated WAL" notices you saw in the log. Now, why would that happen, I have no idea. If you can and want to provide me access to the system, I could have a look myself. I'd also like to see if the attached additional Assertions will fire. I really get this assertion failed: ExceptionalCondition(conditionName = "!(OldPageRqstPtr <= upto || opportunistic)", errorType = "FailedAssertion", fileName = "xlog.c", lineNumber = 1917), line 54 in "assert.c" (dbx) up unnamed block in AdvanceXLInsertBuffer(upto = 147439056632, opportunistic = '\0'), line 1917 in "xlog.c" (dbx) p OldPageRqstPtr 147439058944 (dbx) p upto 147439056632 (dbx) p opportunistic '\0' Also , in another run, I encountered yet another assertion failure: ExceptionalCondition(conditionName = "!NewPageBeginPtr) / 8192) % (XLogCtl->XLogCacheBlck + 1)) == nextidx)", errorType = "FailedAssertion", fileName = "xlog.c", lineNumber = 1950), line 54 in "assert.c" nextidx equals to 1456, while expected value is 1457. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] logical decoding of two-phase transactions
On 31.01.2017 09:29, Michael Paquier wrote: On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <cr...@2ndquadrant.com> wrote: Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when wal_level >= logical I don't think that's the end of the world. But since we already have almost everything we need in memory, why not just stash the gid on ReorderBufferTXN? I have been through this thread... And to be honest, I have a hard time understanding for which purpose the information of a 2PC transaction is useful in the case of logical decoding. The prepare and commit prepared have been received by a node which is at the root of the cluster tree, a node of the cluster at an upper level, or a client, being in charge of issuing all the prepare queries, and then issue the commit prepared to finish the transaction across a cluster. In short, even if you do logical decoding from the root node, or the one at a higher level, you would care just about the fact that it has been committed. Sorry, may be I do not completely understand your arguments. Actually our multimaster is completely based now on logical replication and 2PC (more precisely we are using 3PC now:) State of transaction (prepared, precommitted, committed) should be persisted in WAL to make it possible to perform recovery. Recovery can involve transactions in any state. So there three records in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and recovery can involve either all of them, either PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Deadlock in XLogInsert at AIX
More information about the problem - Postgres log contains several records: 2017-01-24 19:15:20.272 MSK [19270462] LOG: request to flush past end of generated WAL; request 6/AAEBE000, currpos 6/AAEBC2B0 and them correspond to the time when deadlock happen. There is the following comment in xlog.c concerning this message: /* * No-one should request to flush a piece of WAL that hasn't even been * reserved yet. However, it can happen if there is a block with a bogus * LSN on disk, for example. XLogFlush checks for that situation and * complains, but only after the flush. Here we just assume that to mean * that all WAL that has been reserved needs to be finished. In this * corner-case, the return value can be smaller than 'upto' argument. */ So looks like it should not happen. The first thing to suspect is spinlock implementation which is different for GCC and XLC. But ... if I rebuild Postgres without spinlocks, then the problem is still reproduced. On 24.01.2017 17:47, Konstantin Knizhnik wrote: Hi Hackers, We are running Postgres at AIX and encoountered two strqange problems: active zombies process and deadlock in XLOG writer. First problem I will explain in separate mail, now I am mostly concerning about deadlock. It is irregularly reproduced with standard pgbench launched with 100 connections. It sometimes happens with 9.6 stable version of Postgres but only when it is compiled with xlc compiler. We failed to reproduce the problem with GCC. So it looks like as bug in compiler or xlc-specific atomics implementation... But there are few moments which contradicts with this hypothesis: 1. The problem is reproduce with Postgres built without optimization. Usually compiler bugs affect only optimized code. 2. Disabling atomics doesn't help. 3. Without optimization and with LOCK_DEBUG defined time of reproducing the problem significantly increased. With optimized code it is almost always reproduced in few minutes. With debug version it usually takes much more time. But the most confusing thing is stack trace: (dbx) where semop(??, ??, ??) at 0x91f5790 PGSemaphoreLock(sema = 0x0a0044b95928), line 387 in "pg_sema.c" unnamed block in LWLockWaitForVar(lock = 0x0a00d980, valptr = 0x0a00d9a8, oldval = 102067874256, newval = 0x0fff9c10), line 1666 in "lwlock.c" LWLockWaitForVar(lock = 0x0a00d980, valptr = 0x0a00d9a8, oldval = 102067874256, newval = 0x0fff9c10), line 1666 in "lwlock.c" unnamed block in WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c" WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c" AdvanceXLInsertBuffer(upto = 102067874256, opportunistic = '\0'), line 1916 in "xlog.c" unnamed block in GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c" GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c" CopyXLogRecordToWAL(write_len = 70, isLogSwitch = '\0', rdata = 0x00011007ce10, StartPos = 102067874256, EndPos = 102067874328), line 1279 in "xlog.c" XLogInsertRecord(rdata = 0x00011007ce10, fpw_lsn = 102067718328), line 1011 in "xlog.c" unnamed block in XLogInsert(rmid = '\n', info = '@'), line 453 in "xloginsert.c" XLogInsert(rmid = '\n', info = '@'), line 453 in "xloginsert.c" log_heap_update(reln = 0x000110273540, oldbuf = 40544, newbuf = 40544, oldtup = 0x0fffa2a0, newtup = 0x0001102bb958, old_key_tuple = (nil), all_visible_cleared = '\0', new_all_visible_cleared = '\0'), line 7708 in "heapam.c" unnamed block in heap_update(relation = 0x000110273540, otid = 0x0fffa6f8, newtup = 0x0001102bb958, cid = 1, crosscheck = (nil), wait = '^A', hufd = 0x0fffa5b0, lockmode = 0x0fffa5c8), line 4212 in "heapam.c" heap_update(relation = 0x000110273540, otid = 0x0fffa6f8, newtup = 0x0001102bb958, cid = 1, crosscheck = (nil), wait = '^A', hufd = 0x0fffa5b0, lockmode = 0x0fffa5c8), line 4212 in "heapam.c" unnamed block in ExecUpdate(tupleid = 0x0fffa6f8, oldtuple = (nil), slot = 0x0001102bb308, planSlot = 0x0001102b4630, epqstate = 0x0001102b2cd8, estate = 0x0001102b29e0, canSetTag = '^A'), line 937 in "nodeModifyTable.c" ExecUpdate(tupleid = 0x0fffa6f8, oldtuple = (nil), slot = 0x0001102bb308, planSlot = 0x0001102b4630, epqstate = 0x0001102b2cd8, estate = 0x0001102b29e0, canSetTag = '^A'), line 937 in "nodeModifyTable.c" ExecModifyTable(node = 0x0001102b2c30), line 1516 in "nodeModifyTable.c" ExecProcNode(node = 0x0001102b2c30), line 396 in "execProcnode.c" ExecutePlan(estate = 0x0001102b29e0, planstate = 0x0001102b2c30, use_parallel_mode = '\0', operation = CMD_UPDATE,
Re: [HACKERS] Active zombies at AIX
On 24.01.2017 18:26, Tom Lane wrote: Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes: But ps shows that status of process is [14:46:02]root@postgres:~ # ps -elk | grep 25691556 * A - 25691556 - - - - - As far as I could find by googling, this means that the process is not actually a zombie yet, so it's not the postmaster's fault. Apparently it's possible in some versions of AIX for an exiting process to get stuck while releasing its reference to a socket, though I couldn't find much detail about that. I wonder how old your AIX is ... It is AIX 7.1 (I expect that it is most recent version of AIX). regards, tom lane -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Active zombies at AIX
Hi hackers, Yet another story about AIX. For some reasons AIX very slowly cleaning zombie processes. If we launch pgbench with -C parameter then very soon limit for maximal number of connections is exhausted. If maximal number of connection is set to 1000, then after ten seconds of pgbench activity we get about 900 zombie processes and it takes about 100 seconds (!) before all of them are terminated. proctree shows a lot of defunt processes: [14:44:41]root@postgres:~ # proctree 26084446 26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc 4784362 4980786 11403448 11468930 11993176 12189710 12517390 13238374 13565974 13893826 postgres: wal writer process 14024716 15401000 ... 25691556 But ps shows that status of process is [14:46:02]root@postgres:~ # ps -elk | grep 25691556 * A - 25691556 - - - - - Breakpoint set in reaper() function in postmaster shows that each invocation of this functions (called by SIGCHLD handler) proceed 5-10 PIDS per invocation. So there are two hypothesis: either AIX is very slowly delivering SIGCHLD to parent, either exit of process takes too much time. The fact the backends are in exiting state makes second hypothesis more reliable. We have tried different Postgres configurations with local and TCP sockets, with different amount of shared buffers and built both with gcc and xlc. In all cases behavior is similar: zombies do not want to die. As far as it is not possible to attach debugger to defunct process, it is not clear how to understand what's going on. I wonder if somebody has encountered similar problems at AIX and may be can suggest some solution to solve this problem. Thanks in advance -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] Deadlock in XLogInsert at AIX
;), line 1303 in "pquery.c" PortalRunMulti(portal = 0x000110115e20, isTopLevel = '^A', setHoldSnapshot = '\0', dest = 0x0001102b7520, altdest = 0x0001102b7520, completionTag = ""), line 1303 in "pquery.c" unnamed block in PortalRun(portal = 0x000110115e20, count = 9223372036854775807, isTopLevel = '^A', dest = 0x0001102b7520, altdest = 0x0001102b7520, completionTag = ""), line 815 in "pquery.c" PortalRun(portal = 0x000110115e20, count = 9223372036854775807, isTopLevel = '^A', dest = 0x0001102b7520, altdest = 0x0001102b7520, completionTag = ""), line 815 in "pquery.c" unnamed block in exec_simple_query(query_string = "UPDATE pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;"), line 1094 in "postgres.c" exec_simple_query(query_string = "UPDATE pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;"), line 1094 in "postgres.c" unnamed block in PostgresMain(argc = 1, argv = 0x000110119f68, dbname = "postgres", username = "postgres"), line 4076 in "postgres.c" PostgresMain(argc = 1, argv = 0x000110119f68, dbname = "postgres", username = "postgres"), line 4076 in "postgres.c" BackendRun(port = 0x000110114290), line 4279 in "postmaster.c" BackendStartup(port = 0x000110114290), line 3953 in "postmaster.c" unnamed block in ServerLoop(), line 1701 in "postmaster.c" unnamed block in ServerLoop(), line 1701 in "postmaster.c" unnamed block in ServerLoop(), line 1701 in "postmaster.c" ServerLoop(), line 1701 in "postmaster.c" PostmasterMain(argc = 3, argv = 0x0001100c6190), line 1309 in "postmaster.c" main(argc = 3, argv = 0x0001100c6190), line 228 in "main.c" As I already mentioned, we built Postgres with LOCK_DEBUG , so we can inspect lock owner. Backend is waiting for itself! Now please look at two frames in this stack trace marked with red. XLogInsertRecord is setting WALInsert locks at the beginning of the function: if (isLogSwitch) WALInsertLockAcquireExclusive(); else WALInsertLockAcquire(); WALInsertLockAcquire just selects random item from WALInsertLocks array and exclusively locks: if (lockToTry == -1) lockToTry = MyProc->pgprocno % NUM_XLOGINSERT_LOCKS; MyLockNo = lockToTry; immed = LWLockAcquire([MyLockNo].l.lock, LW_EXCLUSIVE); Then, following the stack trace, AdvanceXLInsertBuffer calls WaitXLogInsertionsToFinish: /* * Now that we have an up-to-date LogwrtResult value, see if we * still need to write it or if someone else already did. */ if (LogwrtResult.Write < OldPageRqstPtr) { /* * Must acquire write lock. Release WALBufMappingLock first, * to make sure that all insertions that we need to wait for * can finish (up to this same position). Otherwise we risk * deadlock. */ LWLockRelease(WALBufMappingLock); WaitXLogInsertionsToFinish(OldPageRqstPtr); LWLockAcquire(WALWriteLock, LW_EXCLUSIVE); It releases WALBufMappingLock but not WAL insert locks! Finally in WaitXLogInsertionsToFinish tries to wait for all locks: for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++) { XLogRecPtrinsertingat = InvalidXLogRecPtr; do { /* * See if this insertion is in progress. LWLockWait will wait for * the lock to be released, or for the 'value' to be set by a * LWLockUpdateVar call. When a lock is initially acquired, its * value is 0 (InvalidXLogRecPtr), which means that we don't know * where it's inserting yet. We will have to wait for it. If * it's a small insertion, the record will most likely fit on the * same page and the inserter will release the lock without ever * calling LWLockUpdateVar. But if it has to sleep, it will * advertise the insertion point with LWLockUpdateVar before * sleeping. */ if (LWLockWaitForVar([i].l.lock, [i].l.insertingAt, insertingat, )) And here we stuck! The comment to WaitXLogInsertionsToFinish says: * Note: When you are about to write out WAL, you must call this function * *before* acquiring WALWriteLock, to avoid deadlocks. This function might * need to wait for an insertion to finish (or at least advance to next * uninitialized page), and the inserter might need to evict an old WAL buffer * to make room for a new one, which in turn requires WALWriteLock. Which contradicts to the observed stack trace. I wonder if it is really synchronization bug in xlog.c or there is something wrong in this stack trace and it can not happen in case of normal work? Thanks in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Unlogged tables cleanup
On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote: > > Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you > have locally a standby pointing as well to this tablespace? No, it is latest sources from Postgres repository. Please notice that you should create new database and tablespace to reproduce this issue. So actually the whole sequence is mkdir fs initdb -D pgsql pg_ctl -D pgsql -l logfile start psql postgres # create tablespace fs location '/home/knizhnik/dtm-data/fs'; # set default_tablespace=fs; # create unlogged table foo(x integer); # insert into foo values(generate_series(1,10)); # ^D pkill -9 postgres pg_ctl -D pgsql -l logfile start # select * from foo; > -- > Michael -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Unlogged tables cleanup
Hi, hackers I wonder if such behavior can be considered as a bug: knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs'; CREATE TABLESPACE postgres=# set default_tablespace=fs; SET postgres=# create unlogged table foo(x integer); CREATE TABLE postgres=# insert into foo values(generate_series(1,10)); INSERT 0 10 Now simulate server crash using using "pkill -9 postgres". knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l logfile start pg_ctl: another server might be running; trying to start server anyway server starting knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# select * from foo; ERROR: could not open file "pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory knizhnik@knizhnik:~/dtm-data$ ls fs PG_10_201611041 knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/ So all relation directory is removed! It happens only for first table created in tablespace. If you create table in Postgres data directory everything is ok: first segment of relation is truncated but not deleted. Also if you create one more unlogged table in tablespace it is truncated correctly: postgres=# set default_tablespace=fs; SET postgres=# create unlogged table foo1(x integer); CREATE TABLE postgres=# insert into foo1 values(generate_series(1,10)); INSERT 0 10 postgres=# \q knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l logfile start pg_ctl: another server might be running; trying to start server anyway server starting knizhnik@knizhnik:~/dtm-data$ psql postgres psql (10devel) Type "help" for help. postgres=# select * from foo1; x --- (0 rows) knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/* -rw--- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768 -rw--- 1 knizhnik knizhnik 0 Nov 9 19:52 fs/PG_10_201611041/12289/32768_init -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On conflict update & hint bits
On 24.10.2016 00:49, Peter Geoghegan wrote: On Sun, Oct 23, 2016 at 2:46 PM, Tom Lane <t...@sss.pgh.pa.us> wrote: What's bothering me is (a) it's less than 24 hours to release wrap and (b) this patch changes SSI-relevant behavior and hasn't been approved by Kevin. I'm not familiar enough with that logic to commit a change in it on my own authority, especially with so little time for problems to be uncovered. I should point out that I knew that the next set of point releases had been moved forward much later than you did. I had to work on this fix during the week, which was pretty far from ideal for me for my own reasons. Just for information: I know that you are working on this issue, but as far as we need to proceed further with our testing of multimaster, I have done the following obvious changes and it fixes the problem (at least this assertion failure is not happen any more): src/backend/executor/nodeModifyTable.c @@ -1087,6 +1087,13 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, test = heap_lock_tuple(relation, , estate->es_output_cid, lockmode, LockWaitBlock, false, , ); +/* + * We must hold share lock on the buffer content while examining tuple + * visibility. Afterwards, however, the tuples we have found to be + * visible are guaranteed good as long as we hold the buffer pin. + */ +LockBuffer(buffer, BUFFER_LOCK_SHARE); + switch (test) { case HeapTupleMayBeUpdated: @@ -1142,6 +1149,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, * loop here, as the new version of the row might not conflict * anymore, or the conflicting tuple has actually been deleted. */ +LockBuffer(buffer, BUFFER_LOCK_UNLOCK); ReleaseBuffer(buffer); return false; @@ -1175,6 +1183,8 @@ ExecOnConflictUpdate(ModifyTableState *mtstate, /* Store target's existing tuple in the state's dedicated slot */ ExecStoreTuple(, mtstate->mt_existing, buffer, false); +LockBuffer(buffer, BUFFER_LOCK_UNLOCK); + /* * Make tuple and any needed join variables available to ExecQual and * ExecProject. The EXCLUDED tuple is installed in ecxt_innertuple, while -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] On conflict update & hint bits
On 30.09.2016 19:37, Peter Geoghegan wrote: On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: Later we try to check tuple visibility: ExecCheckHeapTupleVisible(estate, , buffer); and inside HeapTupleSatisfiesMVCC try to set hint bit. So, you're using repeatable read or serializable isolation level? Repeatable read. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] On conflict update & hint bits
Hi, I am faced with rarely reproduced problem at our multimaster (and never at vanilla Postgres). We are using our own customized transaction manager, so it may be definitely the problem in our multimaster. But stack trace looks suspiciously and this is why I want to consult with people familiar with this code whether it is bug in ExecOnConflictUpdate or not. Briefly: ExecOnConflictUpdate tries to set hint bit without holding lock on the buffer and so get assertion failure in MarkBufferDirtyHint. Now stack trace: #0 0x7fe3b940acc9 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:56 #1 0x7fe3b940e0d8 in __GI_abort () at abort.c:89 #2 0x0097b996 in ExceptionalCondition (conditionName=0xb4d970 "!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock", errorType=0xb4d2e9 "FailedAssertion", fileName=0xb4d2e0 "bufmgr.c", lineNumber=3380) at assert.c:54 #3 0x007e365b in MarkBufferDirtyHint (buffer=946, buffer_std=1 '\001') at bufmgr.c:3380 #4 0x009c3660 in SetHintBits (tuple=0x7fe396a9d858, buffer=946, infomask=256, xid=1398) at tqual.c:136 #5 0x009c5194 in HeapTupleSatisfiesMVCC (htup=0x7ffc00169030, snapshot=0x2e79778, buffer=946) at tqual.c:1065 #6 0x006ace83 in ExecCheckHeapTupleVisible (estate=0x2e81ae8, tuple=0x7ffc00169030, buffer=946) at nodeModifyTable.c:197 #7 0x006ae343 in ExecOnConflictUpdate (mtstate=0x2e81d50, resultRelInfo=0x2e81c38, conflictTid=0x7ffc001690c0, planSlot=0x2e82428, excludedSlot=0x2e82428, estate=0x2e81ae8, canSetTag=1 '\001', returning=0x7ffc001690c8) at nodeModifyTable.c:1173 #8 0x006ad256 in ExecInsert (mtstate=0x2e81d50, slot=0x2e82428, planSlot=0x2e82428, arbiterIndexes=0x2e7eeb0, onconflict=ONCONFLICT_UPDATE, estate=0x2e81ae8, canSetTag=1 '\001') at nodeModifyTable.c:395 #9 0x006aebe3 in ExecModifyTable (node=0x2e81d50) at nodeModifyTable.c:1496 In ExecOnConflictUpdate buffer is pinned but not locked: /* * Lock tuple for update. Don't follow updates when tuple cannot be * locked without doing so. A row locking conflict here means our * previous conclusion that the tuple is conclusively committed is not * true anymore. */ tuple.t_self = *conflictTid; test = heap_lock_tuple(relation, , estate->es_output_cid, lockmode, LockWaitBlock, false, , ); heap_lock_tuple is pinning buffer but not locking it: **buffer: set to buffer holding tuple (pinned but not locked at exit) Later we try to check tuple visibility: ExecCheckHeapTupleVisible(estate, , buffer); and inside HeapTupleSatisfiesMVCC try to set hint bit. MarkBufferDirtyHint assumes that buffer is locked: * 2. The caller might have only share-lock instead of exclusive-lock on the * buffer's content lock. and we get assertion failure in /* here, either share or exclusive lock is OK */ Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr))); So the question is whether it is correct that ExecOnConflictUpdate tries to access and update tuple without holding lock on the buffer? Thank in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Handling dropped attributes in pglogical_proto
Hi, pglogical_read_tuple from pglogical_proto.c contains the following code: natts = pq_getmsgint(in, 2); if (rel->natts != natts) elog(ERROR, "tuple natts mismatch, %u vs %u", rel->natts, natts); But if table was just altered and some attribute was removed from the table, then rel->natts can be greater than natts. The problem can be fixed by the following patch: --- a/pglogical_proto.c +++ b/pglogical_proto.c @@ -263,10 +263,15 @@ pglogical_read_tuple(StringInfo in, PGLogicalRelation *rel, { int attid = rel->attmap[i]; Form_pg_attribute att = desc->attrs[attid]; - charkind = pq_getmsgbyte(in); + charkind; const char *data; int len; + if (att->atttypid == InvalidOid) { + continue; + } + kind = pq_getmsgbyte(in); + switch (kind) { case 'n': /* null */ -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] UPSERT strange behavior
On 08/25/2016 10:08 PM, Peter Geoghegan wrote: On Thu, Aug 25, 2016 at 11:49 AM, Tom Lane <t...@sss.pgh.pa.us> wrote: I think the point is that given the way he's set up the test case, there should be no duplicate violation in the plain unique index unless there is one in the arbiter index. So assuming that INSERT tests the arbiter indexes first, there shouldn't be an error. Maybe it doesn't do that, but it seems like it would be a good idea if it did. Oh, yeah. This is arguably an example of inference failing to infer multiple unique indexes as arbiters. Inference could, in principle, recognize that the second unique index is equivalent to the first, but doesn't. (I don't think that it matters which order anything is tested in, though, because not finding a dup value in the arbiter index does not guarantee that there won't be one in the other index. There is no locking when no conflict is initially found, and so no guarantees here.) Anyway, I don't have a lot of sympathy for this point of view, because the scenario is completely contrived. You have to draw the line somewhere. I do not think that this scenario is completely contrived: the cases when a table has more than one primary key are quite common. For example, "user" may have unique e-mail address, phone number and login. Also, as far as I know, this is not an artificial example, but real case taken from customers application... I am not sure weather it's really bug or feature, but the user's intention was obvious: locate record by one of the unique keys and if such record already exists, then increment counter (do update instead of insert). But there are also good arguments why upsert may report conflict in this case... If such UPSERT behavior is assumed to be correct, what is the best workaround for the problem if we really need to have to separate indexes and want to enforce unique constraint for both keys? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Logical decoding restart problems
Thank you for answers. > No, you don't need to recreate them. Just advance your replication identifier > downstream and request a replay position in the future. Let the existing slot > skip over unwanted data and resume where you want to start replay. > > You can advance the replication origins on the peers as you replay forwarded > xacts from your master. > > Have a look at how the BDR code does this during "catchup mode" replay. > > So while your problem discussed below seems concerning, you don't have to > drop and recreate slots like are currently doing. The only reason for recreation of slot is that I want to move it to the current "horizont" and skip all pending transaction without explicit specification of the restart position. If I do not drop the slot and just restart replication specifying position 0/0 (invalid LSN), then replication will be continued from the current slot position in WAL, will not it? So there is no way to specify something "start replication from the end of WAL", like lseek(0, SEEK_END). Right now I trying to overcome this limitation by explicit calculation of the position from which we should continue replication. But unfortunately the problem with partly decoded transactions persist. I will try at next week to create example reproducing the problem without any multimaster stuff, just using standard logical decoding plugin. > > To restart logical decoding we first drop existed slot, then create new one > and then start logical replication from the WAL position 0/0 (invalid LSN). > In this case recovery should be started from the last consistent point. > > How do you create the new slot? SQL interface? walsender interface? Direct C > calls? Slot is created by peer node using standard libpq connection with database=replication connection string. > > The problem is that for some reasons consistent point is not so consistent > and we get partly decoded transactions. > I.e. transaction body consists of two UPDATE but reorder_buffer extracts only > the one (last) update and sent this truncated transaction to destination > causing consistency violation at replica. I started investigation of logical > decoding code and found several things which I do not understand. > > Yeah, that sounds concerning and shouldn't happen. I looked at replication code more precisely and understand that my first concerns were wrong. Confirming flush position should not prevent replaying transactions with smaller LSNs. But unfortunately the problem is really present. May be it is caused by race conditions (although most logical decoder data is local to backend). This is why I will try to create reproducing scenario without multimaster. > > Assume that we have transactions T1={start_lsn=100, end_lsn=400} and > T2={start_lsn=200, end_lsn=300}. > Transaction T2 is sent to the replica and replica confirms that flush_lsn=300. > If now we want to restart logical decoding, we can not start with position > less than 300, because CreateDecodingContext doesn't allow it: > > > Right. You've already confirmed receipt of T2, so you can't receive it again. > > So it means that we have no chance to restore T1? > > Wrong. You can, because the slot's restart_lsn still be will be some LSN <= > 100. The slot keeps track of inprogress transactions (using xl_running_xacts > records) and knows it can't discard WAL past lsn 100 because xact T1 is still > in-progress, so it must be able to decode from the start of it. > > When you create a decoding context decoding starts at restart_lsn not at > confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits start > resulting in decoded data being sent to you. So in your case, T1 commits at > lsn=400, which is >300, so you'll receive the whole xact for T1. Yeh, but unfortunately it happens. Need to understand why... > > It's all already there. See logical decoding's use of xl_running_xacts. But how this information is persisted? What will happen if wal_sender is restarted? > > -- > Craig Ringer http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Logical decoding restart problems
On 19.08.2016 19:06, Petr Jelinek wrote: I don't think this will work correctly, there will be gap between when the new slot starts to decode and the drop of the old one as the new slot first needs to make snapshot. Do I understand correctly that you are not using replication origins? No, we are using replication origins to prevent recursive replication. The gap is not a biggest problem because in multimaster we commit only those transaction which are acknowledged by all live nodes. So if some transaction falls in the gap... it is unlucky and will be aborted. The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions. I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica. I started investigation of logical decoding code and found several things which I do not understand. Never seen this happen. Do you have more details about what exactly is happening? This is transaction at primary node: root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec -ti node1 bash postgres@9a04a0c9f246:/pg$ pg_xlogdump data/pg_xlog/00010001 | fgrep "tx: 6899" rmgr: Heaplen (rec/tot): 7/53, tx: 6899, lsn: 0/019EFBD8, prev 0/019EFB80, desc: LOCK off 12: xid 6899 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 60 rmgr: Heaplen (rec/tot): 14/74, tx: 6899, lsn: 0/019EFC10, prev 0/019EFBD8, desc: HOT_UPDATE off 12 xmax 6899 ; new off 224 xmax 6899, blkref #0: rel 1663/12407/16421 blk 60 rmgr: Heaplen (rec/tot): 7/53, tx: 6899, lsn: 0/019EFC60, prev 0/019EFC10, desc: LOCK off 153: xid 6899 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 49 rmgr: Heaplen (rec/tot): 14/82, tx: 6899, lsn: 0/019EFC98, prev 0/019EFC60, desc: UPDATE off 153 xmax 6899 ; new off 55 xmax 6899, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel 1663/12407/16421 blk 49 rmgr: Btree len (rec/tot): 2/64, tx: 6899, lsn: 0/019EFCF0, prev 0/019EFC98, desc: INSERT_LEAF off 294, blkref #0: rel 1663/12407/16424 blk 23 rmgr: Transaction len (rec/tot):236/ 265, tx: 6899, lsn: 0/019EFD30, prev 0/019EFCF0, desc: PREPARE pg_xlogdump: FATAL: error in WAL record at 0/1AC0E70: invalid record length at 0/1AC0EA8: wanted 24, got 0 This is the replicated transaction at other node (it is ont clear from the trace, but believe me, it is the same transaction): root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec -ti node2 bash postgres@e5b16d82ce06:/pg$ pg_xlogdump data/pg_xlog/00010001 | fgrep "tx: 6882" rmgr: Heaplen (rec/tot): 14/74, tx: 6882, lsn: 0/019F3240, prev 0/019F31F0, desc: HOT_UPDATE off 113 xmax 6882 ; new off 219 xmax 0, blkref #0: rel 1663/12407/16421 blk 53 rmgr: Heaplen (rec/tot): 14/82, tx: 6882, lsn: 0/019F5CB8, prev 0/019F5C60, desc: UPDATE off 163 xmax 6882 ; new off 4 xmax 0, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel 1663/12407/16421 blk 51 rmgr: Btree len (rec/tot): 2/64, tx: 6882, lsn: 0/019F5D10, prev 0/019F5CB8, desc: INSERT_LEAF off 284, blkref #0: rel 1663/12407/16424 blk 23 rmgr: Transaction len (rec/tot):248/ 274, tx: 6882, lsn: 0/019F61F8, prev 0/019F60E8, desc: PREPARE pg_xlogdump: FATAL: error in WAL record at 0/1AD3AD8: invalid record length at 0/1AD3B10: wanted 24, got 0 And "shorten" version of the same transaction at the third (recovered) node: root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec -ti node3 bash postgres@b4066d4211bc:/pg$ pg_xlogdump data/pg_xlog/00010001 | fgrep "tx: 6893" rmgr: Heaplen (rec/tot): 7/53, tx: 6893, lsn: 0/01A29828, prev 0/01A297E0, desc: LOCK off 172: xid 6893 LOCK_ONLY EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 50 rmgr: Heaplen (rec/tot): 14/82, tx: 6893, lsn: 0/01A29860, prev 0/01A29828, desc: UPDATE+INIT off 172 xmax 6893 ; new off 1 xmax 6893, blkref #0: rel 1663/12407/16421 blk 64, blkref #1: rel 1663/12407/16421 blk 50 rmgr: Btree len (rec/tot): 2/64, tx: 6893, lsn: 0/01A298B8, prev 0/01A29860, desc: INSERT_LEAF off 314, blkref #0: rel 1663/12407/16424 blk 23 rmgr: Transaction len (rec/tot):236/ 265, tx: 6893, lsn: 0/01A298F8, prev 0/01A298B8, desc: PREPARE pg_xlogdump: FATAL: error in WAL record at 0/1ACBBF8: invalid record length at 0/1ACBC30: wanted 24, got 0 You can see one update instead of two. Sorry, I have not saved trace with output of logical decoder. Bu t it really decodes just one update! What I have done: DROP_REPLICATION_SLOT "mtm_slot_1"; CREATE_
[HACKERS] Logical decoding restart problems
Hi, We are using logical decoding in multimaster and we are faced with the problem that inconsistent transactions are sent to replica. Briefly, multimaster is using logical decoding in this way: 1. Each multimaster node is connected with each other using logical decoding channel and so each pair of nodes has its own replication slot. 2. In normal scenario each replication channel is used to replicate only those transactions which were originated at the source node. We are using origin mechanism to skip "foreign" transactions. 2. When offline cluster node is returned back to the multimaster we need to recover this node to the current cluster state. Recovery is performed from one of the cluster's node. So we are using only one replication channel to receive all (self and foreign) transactions. Only in this case we can guarantee consistent order of applying transactions at recovered node. After the end of recovery we need to recreate replication slots with all other cluster nodes (because we have already replied transactions from this nodes). To restart logical decoding we first drop existed slot, then create new one and then start logical replication from the WAL position 0/0 (invalid LSN). In this case recovery should be started from the last consistent point. The problem is that for some reasons consistent point is not so consistent and we get partly decoded transactions. I.e. transaction body consists of two UPDATE but reorder_buffer extracts only the one (last) update and sent this truncated transaction to destination causing consistency violation at replica. I started investigation of logical decoding code and found several things which I do not understand. Assume that we have transactions T1={start_lsn=100, end_lsn=400} and T2={start_lsn=200, end_lsn=300}. Transaction T2 is sent to the replica and replica confirms that flush_lsn=300. If now we want to restart logical decoding, we can not start with position less than 300, because CreateDecodingContext doesn't allow it: * start_lsn * The LSN at which to start decoding. If InvalidXLogRecPtr, restart * from the slot's confirmed_flush; otherwise, start from the specified * location (but move it forwards to confirmed_flush if it's older than * that, see below). * else if (start_lsn < slot->data.confirmed_flush) { /* * It might seem like we should error out in this case, but it's * pretty common for a client to acknowledge a LSN it doesn't have to * do anything for, and thus didn't store persistently, because the * xlog records didn't result in anything relevant for logical * decoding. Clients have to be able to do that to support synchronous * replication. */ So it means that we have no chance to restore T1? What is worse, if there are valid T2 transaction records with lsn >= 300, then we can partly decode T1 and send this T1' to the replica. I missed something here? Are there any alternative way to "seek" slot to the proper position without actual fetching data from it or recreation of the slot? Is there any mechanism in xlog which can enforce consistent decoding of transaction (so that no transaction records are missed)? May be I missed something but I didn't find any "record_number" or something else which can identify first record of transaction. Thanks in advance, Konstantin Knizhnik, Postgres Professional: http://www.postgrespro.com Russian Postgres Company
Re: [HACKERS] WIP: Barriers
On 15.08.2016 15:42, Thomas Munro wrote: This implementation is using a spinlock for the arrival counter, and signals (via Robert's condition variables and latches) for waking up peer processes when the counter reaches the target. I realise that using signals for this sort of thing is a bit unusual outside the Postgres universe, but won't a semaphore-based implementation require just as many system calls, context switches and scheduling operations? Yes, you are right. I never expected that this combination of signal+local socket+select can provide performance comparable with pthread_cond_t. I have implemented simple test where two background workers are emulating request-response round-trip using latches and pthread primitives. Result (average round-trip time) was 7.49 microseconds for Postgres latches vs. 4.59 microseconds for pthread_cond_timedwait. #define N_ROUNDTRIPS 100 #define WAIT_LATCH_TIMEOUT 6 static void PongLatch(Datum arg) { int i; timestamp_t start; int result; BackgroundWorkerUnblockSignals(); Mtm->pong = MyProc->pgprocno; ResetLatch(>procLatch); MtmSleep(100); Assert(Mtm->ping); for (i = 0; i <= N_ROUNDTRIPS; i++) { result = WaitLatch(>procLatch, WL_LATCH_SET|WL_TIMEOUT, WAIT_LATCH_TIMEOUT); Assert(result & WL_LATCH_SET); ResetLatch(>procLatch); SetLatch(>allProcs[Mtm->ping].procLatch); if (i == 0) { start = MtmGetSystemTime(); } } fprintf(stderr, "Average roundrip time: %f microsconds\n", (double)(MtmGetSystemTime() - start) / N_ROUNDTRIPS); } static void PingLatch(Datum arg) { int i; timestamp_t start; int result; BackgroundWorkerUnblockSignals(); Mtm->ping = MyProc->pgprocno; ResetLatch(>procLatch); MtmSleep(100); Assert(Mtm->pong); for (i = 0; i <= N_ROUNDTRIPS; i++) { SetLatch(>allProcs[Mtm->pong].procLatch); result = WaitLatch(>procLatch, WL_LATCH_SET|WL_TIMEOUT, WAIT_LATCH_TIMEOUT); Assert(result & WL_LATCH_SET); ResetLatch(>procLatch); if (i == 0) { start = MtmGetSystemTime(); } } fprintf(stderr, "Average roundrip time: %f microseconds\n", (double)(MtmGetSystemTime() - start) / N_ROUNDTRIPS); } static BackgroundWorker Pinger = { "ping", BGWORKER_SHMEM_ACCESS,// | BGWORKER_BACKEND_DATABASE_CONNECTION, BgWorkerStart_ConsistentState, BGW_NEVER_RESTART, PingLatch }; static BackgroundWorker Ponger = { "pong", BGWORKER_SHMEM_ACCESS,// | BGWORKER_BACKEND_DATABASE_CONNECTION, BgWorkerStart_ConsistentState, BGW_NEVER_RESTART, PongLatch }; static void PingPong() { RegisterBackgroundWorker(); RegisterBackgroundWorker(); } -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] WIP: Barriers
Hi Thomas, Barriers are really very simple and convenient mechanism for process synchronization. But it is actually a special case of semaphores: having semaphore primitive it is trivial to implement a barrier. We have semaphores in Postgres, but ... them can not be used by extensions: there is fixed number of semaphores allocated based on maximal number of connections and there is no mechanism for requesting additional semaphores. Rober has recently proposed conditional variables, which are also very useful. Right now we have spinlocks, LW-locks and latches. >From my point of view it is not enough. While creating various extensions for >Postgres I always fill lack of such synchronization primitive as events >(condition variables) and semaphores. Events and semaphores are similar and it >is possible to implement any of them based on another. But from user's point >of view them have different semantic and use cases, so it is better to provide >both of them. I wonder if we should provide system-abstraction-layer (SAL) for Postgres, where we can isolate all system dependent code and which will be available for vore developers as well as for developers of extensions? The obvious candidates for SAL are: 1. Synchronization primitives (locks, events, semaphores, mutexes, spinlocks, latches, barriers) 2. Shared memory 3. File access 4. Network sockets 5. Process control (fork, ...) Certainly it requires a lot of refactoring but will make Postgres code much more elegant, easer to read and maintain. Also it is not necessary to do all this changes in one step: here we do not need atomic transactions:) We can start for example with synchronization primitives, as far as in any case a lot of changes are proposed here. Parallel execution is one of the most promising approach to improve Postgres performance. I do not mean just parallel execution of single query. Parallel vacuum, parallel index creation, parallel sort, ... And to implement all this stuff we definitely need convenient and efficient synchronization primitives. The set of such primitives can be discussed. IMHO it should include RW-locks (current LW-locks), mutexes (spinlock + some mechanism to wait), events (condition variables), semaphores and barriers (based on semaphores). Latches can be left for backward compatibility or be replaced with events. I wonder if somebody has measured how much times latches (signal+socket) are slower then posix semaphores or conditional variables? On Aug 14, 2016, at 2:18 AM, Thomas Munro wrote: > Hi hackers, > > I would like to propose "barriers" for Postgres processes. A barrier > is a very simple mechanism for coordinating parallel computation, as > found in many threading libraries. > > First, you initialise a Barrier object somewhere in shared memory, > most likely in the DSM segment used by parallel query, by calling > BarrierInit(, nworkers). Then workers can call > BarrierWait() when they want to block until all workers arrive > at the barrier. When the final worker arrives, BarrierWait returns in > all workers, releasing them to continue their work. One arbitrary > worker receives a different return value as a way of "electing" it to > perform serial phases of computation. For parallel phases of > computation, the return value can be ignored. For example, there may > be preparation, merging, or post-processing phases which must be done > by just one worker, interspersed with phases where all workers do > something. > > My use case for this is coordinating the phases of parallel hash > joins, but I strongly suspect there are other cases. Parallel sort > springs to mind, which is why I wanted to post this separately and > earlier than my larger patch series, to get feedback from people > working on other parallel features. > > A problem that I'm still grappling with is how to deal with workers > that fail to launch. What I'm proposing so far is based on static > worker sets, where you can only give the number of workers at > initialisation time, just like pthread_barrier_init. Some other > libraries allow for adjustable worker sets, and I'm wondering if a > parallel leader might need to be able to adjust the barrier when it > hears of a worker not starting. More on that soon. > > Please see the attached WIP patch. I had an earlier version with its > own waitlists and signalling machinery etc, but I've now rebased it to > depend on Robert Haas's proposed condition variables, making this code > much shorter and sweeter. So it depends on his > condition-variable-vX.patch[1], which in turn depends on my > lwlocks-in-dsm-vX.patch[2] (for proclist). > > When Michaël Paquier's work on naming wait points[3] lands, I plan to > include event IDs as an extra argument to BarrierWait which will be > passed though so as to show up in pg_stat_activity. Then you'll be > able to see where workers are waiting for each other! For now I > didn't want to tangle this up with yet
Re: [HACKERS] Restriction of windows functions
On 17.06.2016 17:01, Tom Lane wrote: Certainly it assumes that window is ordered by key and the key type supports subtraction, so "text" can not be used here. IIRC, the sticking point was defining a reasonably datatype-independent (i.e. extensible) notion of distance. Why it is not possible just to locate "-" or "+ operator for this type? I do not see any difference here with locating comparison operator needed for sorting. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Restriction of windows functions
Hi, One of the popular queries in financial analytic systems is to calculate some moving aggregate within some time interval, i.e. moving average of trade price within 5 minutes window. Unfortunately this case is not supported by PostgreSQL: select symbol,date,avg(price) over (order by date range between '5 minutes' preceding and current row) from Trades; ERROR: RANGE PRECEDING is only supported with UNBOUNDED Is there some principle problem in implementing such kind of window? May be I missed something, but it seems to me that it should not be very difficult. There is update_frameheadpos function which adjusts head position of windows in "rows" mode and reports error in rows mode: if (frameOptions & FRAMEOPTION_ROWS) { /* In ROWS mode, bound is physically n before/after current */ int64offset = DatumGetInt64(winstate->startOffsetValue); if (frameOptions & FRAMEOPTION_START_VALUE_PRECEDING) offset = -offset; winstate->frameheadpos = winstate->currentpos + offset; /* frame head can't go before first row */ if (winstate->frameheadpos < 0) winstate->frameheadpos = 0; else if (winstate->frameheadpos > winstate->currentpos) { /* make sure frameheadpos is not past end of partition */ spool_tuples(winstate, winstate->frameheadpos - 1); if (winstate->frameheadpos > winstate->spooled_rows) winstate->frameheadpos = winstate->spooled_rows; } winstate->framehead_valid = true; } else if (frameOptions & FRAMEOPTION_RANGE) { /* parser should have rejected this */ elog(ERROR, "window frame with value offset is not implemented"); } else The straightforward approach to support range mode is to advance head position until "distance" between head and current row is less or equal than specified range value. Looks like not something too complex to implement, doesn't it? Are there some caveats? Certainly it assumes that window is ordered by key and the key type supports subtraction, so "text" can not be used here. Something else? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array of domain types
On 03.06.2016 02:02, Rod Taylor wrote: On Thu, Jun 2, 2016 at 10:42 AM, Konstantin Knizhnik <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote: On 02.06.2016 17:22, Tom Lane wrote: konstantin knizhnik <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> writes: Attached please find patch for DefineDomain function. You didn't attach the patch, Sorry, but I did attached the patch - I see the attachment in my mail received from the group. Multidimensional arrays work fine: knizhnik=# SELECT '{{14},{20}}'::teenager[][]; ERROR: value for domain teenager violates check constraint "teenager_check" LINE 1: SELECT '{{14},{20}}'::teenager[][]; ^ knizhnik=# SELECT '{{14},{19}}'::teenager[][]; teenager - {{14},{19}} (1 row) knizhnik=# SELECT ('{{14},{19}}'::teenager[][])[1][1]; teenager -- 14 (1 row) Domain of array of domain also works: I applied the domain.patch from above on HEAD, and all I get is cache lookup failures. The type_sanity regression test fails too. postgres=# CREATE DOMAIN teenager AS int CHECK (VALUE BETWEEN 13 AND 20); CREATE DOMAIN postgres=# CREATE DOMAIN teenager_groups AS teenager[]; CREATE DOMAIN postgres=# CREATE TABLE x (col teenager_groups); ERROR: cache lookup failed for type 0 Anyway, if that worked for me I would have done this which I expect will succeed when it shouldn't. INSERT INTO x VALUES (ARRAY[13,14,20]); ALTER DOMAIN teenager DROP CONSTRAINT teenager_check; ALTER DOMAIN teenager ADD CHECK (VALUE BETWEEN 13 AND 19); Sorry, the problem is more difficult than I originally expected:( Attached patch passes all regression tests and correctly handle conversion of arrays. But constraints are not checked for table columns. I failed to locate place where this check should be inserted... Originally I was mostly interested in domains as kind of typedefs: convenient way to assign type to some particular kind of columns, for example object reference used in ORM. There are two main goals of using domain here: 1. Be able to easily change representation of object identifier, for example from integer to bigint. 2. Detect all columns containing references (distinguish them from columns containing just normal integers). I do not see any other mechanism in PostgreSQL which can address this problem (for example user defined type can not help here). I wonder if it is possible to support arrays of domain which do not have constraints? Or such partial support is worser than prohibiting arrays of domains at all? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c index 71d4df9..e227fa8 100644 --- a/src/backend/commands/typecmds.c +++ b/src/backend/commands/typecmds.c @@ -752,11 +752,13 @@ DefineDomain(CreateDomainStmt *stmt) ListCell *listptr; Oid basetypeoid; Oid old_type_oid; + Oid array_oid; Oid domaincoll; Form_pg_type baseType; int32 basetypeMod; Oid baseColl; ObjectAddress address; + char *array_type; /* Convert list of names to a name and namespace */ domainNamespace = QualifiedNameGetCreationNamespace(stmt->domainname, @@ -1023,6 +1025,14 @@ DefineDomain(CreateDomainStmt *stmt) } } + + /* + * OK, we're done checking, time to make the type. We must assign the + * array type OID ahead of calling TypeCreate, since the base type and + * array type each refer to the other. + */ + array_oid = AssignTypeArrayOid(); + /* * Have TypeCreate do all the real work. */ @@ -1047,7 +1057,7 @@ DefineDomain(CreateDomainStmt *stmt) analyzeProcedure, /* analyze procedure */ InvalidOid, /* no array element type */ false, /* this isn't an array */ - InvalidOid, /* no arrays for domains (yet) */ + array_oid, /* array type we are about to create */ basetypeoid, /* base type ID */ defaultValue, /* default type value (text) */ defaultValueBin, /* default type value (binary) */ @@ -1060,6 +1070,48 @@ DefineDomain(CreateDomainStmt *stmt) domaincoll); /* type's collation */ /* + * Create the array type that goes with it. + */ + array_type = makeArrayTypeName(domainName, domainNamespace); + +/* alignment must be 'i' or 'd' for arrays */ + alignment = (alignment == 'd') ? 'd' : 'i'; + + TypeCreate(array_oid,/* force assignment of this type OID */ + array_type,/* type name */ + domainNamespace,/* namespace */ + InvalidOid,/* relation oid (n/a here) */ + 0,/* relation kind (ditto) */ + GetUserId(),/* owner's ID */ + -1,/* internal size (always varlena) */ + TYPTYPE_BASE,/* type-type (base type) */ + TYPCATEGORY_ARRAY,/* type-category (array) */ + fal
Re: [HACKERS] XTM & parallel search
On 03.06.2016 16:05, Amit Kapila wrote: On Fri, Jun 3, 2016 at 1:34 AM, Konstantin Knizhnik <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote: We have to add three more functions to eXtensible Transaction Manager API (XTM): /* * Calculate transaction state size. This method is invoked by EstimateTransactionStateSpace to copy transaction * state to parallel workers */ size_t (*GetTransactionStateSize)(void); /* * Serialize transaction state */ void(*SerializeTransactionState)(void* ctx); /* * Deserialize transaction state */ void(*DeserializeTransactionState)(void* ctx); In above proposal, are you suggesting to change the existing API's as well, because the parameters of function pointers don't match with exiting API's. I think it is better to consider this along with the overall XTM API. Sorry, but right now I have not replaced existed functions EstimateTransactionStateSpace/SerializeTransactionState/StartParallelWorkerTransaction with XTM indirect calls. If was my original intention, but these functions access static variable CurrentTransactionState defined in xact.c. So if user-defined TM wants to override this functions, it will have to invoke original functions to save/restore CurrentTransactionState. It is not convenient. This is why three XTM functions above are now called by existed xact funcations to save additional state, for example: Size EstimateTransactionStateSpace(void) { TransactionState s; Sizenxids = 6;/* iso level, deferrable, top & current XID, * command counter, XID count */ for (s = CurrentTransactionState; s != NULL; s = s->parent) { if (TransactionIdIsValid(s->transactionId)) nxids = add_size(nxids, 1); nxids = add_size(nxids, s->nChildXids); } nxids = add_size(nxids, nParallelCurrentXids); nxids = mul_size(nxids, sizeof(TransactionId)); return add_size(nxids, TM->GetTransactionStateSize()); } -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
[HACKERS] XTM & parallel search
We have to add three more functions to eXtensible Transaction Manager API (XTM): /* * Calculate transaction state size. This method is invoked by EstimateTransactionStateSpace to copy transaction * state to parallel workers */ size_t (*GetTransactionStateSize)(void); /* * Serialize transaction state */ void(*SerializeTransactionState)(void* ctx); /* * Deserialize transaction state */ void(*DeserializeTransactionState)(void* ctx); The reason is that we find out that our multimaster is not correctly working when max_parallel_workers > 0 because multimaster transaction context is not properly shared between workers. Unfortunately right now serialization/deserialization of transaction state is hardcoded in xact.c. and IMHO is done in quite ugly way: XactIsoLevel = (int) tstate[0]; XactDeferrable = (bool) tstate[1]; XactTopTransactionId = tstate[2]; CurrentTransactionState->transactionId = tstate[3]; currentCommandId = tstate[4]; nParallelCurrentXids = (int) tstate[5]; ParallelCurrentXids = [6]; - there is even no declared structure with fixed part of saved context. I wonder if not only DTM will be interested in sharing some common state between workers and should we provide some way of replicating user defined context between workers? From my point of view XTM seems to be good place for it... -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array of domain types
On 02.06.2016 17:22, Tom Lane wrote: konstantin knizhnik <k.knizh...@postgrespro.ru> writes: Attached please find patch for DefineDomain function. You didn't attach the patch, Sorry, but I did attached the patch - I see the attachment in my mail received from the group. Multidimensional arrays work fine: knizhnik=# SELECT '{{14},{20}}'::teenager[][]; ERROR: value for domain teenager violates check constraint "teenager_check" LINE 1: SELECT '{{14},{20}}'::teenager[][]; ^ knizhnik=# SELECT '{{14},{19}}'::teenager[][]; teenager - {{14},{19}} (1 row) knizhnik=# SELECT ('{{14},{19}}'::teenager[][])[1][1]; teenager -- 14 (1 row) Domain of array of domain also works: knizhnik=# create domain teenager_groups as teenager[]; CREATE DOMAIN knizhnik=# SELECT '{{14},{19}}'::teenager_groups; teenager_groups - {{14},{19}} (1 row) knizhnik=# SELECT '{{14},{20}}'::teenager_groups; ERROR: value for domain teenager violates check constraint "teenager_check" LINE 1: SELECT '{{14},{20}}'::teenager_groups; but in any case, I would be astonished if there is no work required beyond creating the matching array type. The reverse case (domains over arrays) has half a dozen special cases required to make it work smoothly. Probably the considerations on this side are totally different, but it's hard to believe there aren't any. One case that seems likely to be pretty squishy is an array of a domain over an array type. One would wish to be able to do foo[2][4] to extract an element of the contained array. That won't work as-is because the notation will be taken as a multi-dimensional subscript, but I would expect that (foo[2])[4] should work. Does it? Does ruleutils.c always include the necessary parens when reverse-listing such a construct? Is it possible to assign to such a sub-element, and if so, do the domain constraints get checked properly? Domain over an array that is of a domain type might be another fun case. regards, tom lane -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] array of domain types
On Jun 2, 2016, at 12:29 PM, Thom Brown wrote:On 2 June 2016 at 10:13, konstantin knizhnik <k.knizh...@postgrespro.ru> wrote:Yes, it doesn't work:# CREATE DOMAIN teenager AS int CHECK (VALUE BETWEEN 13 AND 19);CREATE DOMAIN# SELECT 14::teenager; teenager -- 14(1 row)# SELECT 20::teenager;ERROR: value for domain teenager violates check constraint "teenager_check"# SELECT '{14,20}'::teenager[]; teenager -- {14,20}(1 row)That last one should fail.Yes, I see.This approach was wrong.Attached please find patch for DefineDomain function. domain.patch Description: Binary data Thom
Re: [HACKERS] array of domain types
On Jun 1, 2016, at 4:37 PM, Thom Brown wrote: > On 1 June 2016 at 14:20, Konstantin Knizhnik <k.knizh...@postgrespro.ru> > wrote: > I wonder why domain types can not be used for specification of array element: > > create domain objref as bigint; > create table foo(x objref[]); > ERROR: type "objref[]" does not exist > create table foo(x bigint[]); > CREATE TABLE > > Is there some principle problem here or it is just not implemented? > > It's not implemented, but patches welcome. > > Thom The patch is trivial: just use typbasetype in get_array_type if typtype is TYPTYPE_DOMAIN: diff --git a/src/backend/utils/cache/lsyscache.c b/src/backend/utils/cache/lsyscache.c index cb26d79..ecfbb20 100644 --- a/src/backend/utils/cache/lsyscache.c +++ b/src/backend/utils/cache/lsyscache.c @@ -2486,7 +2486,18 @@ get_array_type(Oid typid) if (HeapTupleIsValid(tp)) { result = ((Form_pg_type) GETSTRUCT(tp))->typarray; - ReleaseSysCache(tp); + if (result == InvalidOid && ((Form_pg_type) GETSTRUCT(tp))->typtype == TYPTYPE_DOMAIN) { + typid = ((Form_pg_type) GETSTRUCT(tp))->typbasetype; + ReleaseSysCache(tp); + tp = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid)); + if (HeapTupleIsValid(tp)) + { + result = ((Form_pg_type) GETSTRUCT(tp))->typarray; + ReleaseSysCache(tp); + } + } else { + ReleaseSysCache(tp); + } } return result; } Any problems with it?
[HACKERS] array of domain types
I wonder why domain types can not be used for specification of array element: create domain objref as bigint; create table foo(x objref[]); ERROR: type "objref[]" does not exist create table foo(x bigint[]); CREATE TABLE Is there some principle problem here or it is just not implemented? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Logical replication & oldest XID.
Hi, We are using logical replication in multimaster and are faced with some interesting problem with "frozen" procArray->replication_slot_xmin. This variable is adjusted by ProcArraySetReplicationSlotXmin which is invoked by ReplicationSlotsComputeRequiredXmin, which is in turn is called by LogicalConfirmReceivedLocation. If transactions are executed at all nodes of multimaster, then everything works fine: replication_slot_xmin is advanced. But if we send transactions only to one multimaster node and broadcast this changes to other nodes, then no data is send through replications slot at this nodes. No data sends - no confirmations, LogicalConfirmReceivedLocation is not called and procArray->replication_slot_xmin preserves original value 599. As a result GetOldestXmin function always returns 599, so autovacuum is actually blocked and our multimaster is not able to perform cleanup of XID->CSN map, which cause shared memory overflow. This situation happens only when write transactions are sent only to one node or if there are no write transactions at all. Before implementing some workaround (for example forces all of ReplicationSlotsComputeRequiredXmin), I want to understand if it is real problem of logical replication or we are doing something wrong? BDR should be faced with the same problem if all updates are performed from one node... -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Inheritance
There is one aspect of inheritance support which was not mentioned: polymorphic queries. Actually polymorphism is the fundamental feature of OOP, without it there is no behavioral inheritance and inheritance can be considered just as "syntax sugar" for sharing some common subset of attributes between tables. The main problem with supporting polymorphic queries is that SQL query returns set of tuples, not set of objects. So there is no nice way to return both instances of based and derived tables. There are several alternatives (for example return joined set of attributes in all derived tables, leaving missed as NULLs) but none of them is good. There is more specific problem - polymorphic function calls, see: http://postgresql.nabble.com/Polymorphic-function-calls-td5784772.html which in principle can be solved... But looks like there is no such intention. I do not know how much people will be interested in normal OO support from DBMS. Right now PostgreQSL inheritance is used for different purposes: partitioning, FDW-based replication or sharding, ... But IMHO it is mostly because current inheritance implementation just not make it possible to use it for OOP. As far as I know before there were a lot of discussions whether arrays, json and other non-traditional for RDBMS types is needed. But right now them are quite popular and frequently used. The same thing can happen with OO support: it will be popular if provide necessary functionality. Right now OO DBMS applications are mostly using ORMs and them are not using PostgreSQL inheritance. Mostly because ORMs are used to be more flexible in providing different approaches for implementing inheritance (for example in most cases placing all derived classes in single table works better than approach based on appending results of several queries). Another reason is that ORM usually support more than one DBMS, and since most of them have no built-in support for inheritance, ORM has to implement inheritance itself. But I do not think that presence of ORM excludes necessity to have internal support of OO in DBMS. On 24.05.2016 16:51, Jim Nasby wrote: On 5/23/16 10:30 PM, Craig Ringer wrote: I find it notable that no popular ORM has bothered adopting PostgreSQL's inheritance features, and instead just use big left joins or repeated SELECTs to implement parent/child relationships, with foreign keys enforcing constraints. Since when do we consider ORMs to be an example of how to do good database design? In this case, I'm sure no ORM uses the feature because either the ORM's authors have no clue it exists (except maybe for partitioning) or because we're the only mainstream database that has it. I consider inheritance mostly useless without the ability to have UNIQUE indexes that span a parent relation and all its children. You can use them for partitioning only by sacrificing a bunch of integrity protection or creating messy chains of FKs between individual partitions. There's ways around this issue, but I agree that it's a very unhappy situation right now. I'd rather like to quietly deprecate inheritance and eventually remove it once we have real partitioning and some time has passed... IMO that's a very unfortunate attitude to have for the "Worlds Most Advanced Open Source Database". Now that OO programming isn't treated as a magic bullet it's proven to be an extremely powerful tool, especially when used to encapsulate complex data. The ndarray->Series->DataFrame->Panel inheritance tree is a great example of this. Feature-wise, we need to be doing *more* of this kind of work, not less. Lack of support for OO paradigms was one of the drivers for NoSQL. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Just-in-time compiling things
On 05/14/2016 12:10 PM, Andreas Seltenreich wrote: Konstantin Knizhnik writes: Latest information from ISP RAS guys: them have made good progress since February: them have rewritten most of methods of Scan, Aggregate and Join to LLVM API. Is their work available somewhere? I'm experimenting in that area as well, although I'm using libFirm instead of LLVM. I wonder what their motivation to rewrite backend code in LLVM IR was, since I am following the approach of keeping the IR around when compiling the vanilla postgres C code, possibly inlining it during JIT and then doing optimizations on this IR. That way the logic doesn't have to be duplicated. The work is not yet completed but finally it will be definitely put to open source. I am going to talk a little bit about this project at PGcon in Ottawa at lighting talks, although I do not know details of the project myself. The main difference of their approach comparing with Vitesse DB is that them implement a way of automatic conversion of PostgreSQL operators to LLVM IR. So instead of rewritting ~2000 operators manually (a lot of work and errors), them implement converter which transform the code of this operators to ... C++ code producing LLVM IR. So manually them need to rewrite only plan nodes. Them already implemented most of nodes (SeqScan, Sort, HashJoin,...) which allows to execute all TPC-H queries. Result will be published soon. The larghest advantage is definitely at Q1 - about 4 times. It is worser than Vitesse DB (8 times) and with manually written operators (7 times). The most probable reason causing such performance penalty is overflow checking: in manually written LLVM code it can be done in more efficient way using correspondent assembler instruction than in code automatically converted from standard C. But ISP RAS guys are going to fix this problem and improve automatic conversion quality. I include in CC members of ISP RAS team - you can ask them questions directly. regrads Andreas -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Academic help for Postgres
On May 12, 2016, at 6:16 AM, Rajeev rastogi wrote: > On 11 May 2016 19:50, Bruce Momjian Wrote: > > >> I am giving a keynote at an IEEE database conference in Helsinki next >> week (http://icde2016.fi/). (Yes, I am not attending PGCon Ottawa >> because I accepted the Helsinki conference invitation before the PGCon >> Ottawa date was changed from June to May). >> >> As part of the keynote, I would like to mention areas where academia can >> help us. The topics I can think of are: >> >> Query optimization >> Optimizer statistics >> Indexing structures >> Reducing function call overhead >> CPU locality >> Sorting >> Parallelism >> Sharding >> >> Any others? > > How about? > 1. Considering NUMA aware architecture. > 2. Optimizer tuning as per new hardware trends. > 3. More effective version of Join algorithms (e.g. Compare to traditional > "build and then probe" mechanism of Hash Join, now there is pipelining Hash > join where probe and build both happens together). Interesting article about optimal joins: http://arxiv.org/pdf/1203.1952v1.pdf > > Thanks and Regards, > Kumar Rajeev Rastogi > > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Academic help for Postgres
On 11.05.2016 17:32, Bruce Momjian wrote: On Wed, May 11, 2016 at 05:31:10PM +0300, Konstantin Knizhnik wrote: On 11.05.2016 17:20, Bruce Momjian wrote: I am giving a keynote at an IEEE database conference in Helsinki next week (http://icde2016.fi/). (Yes, I am not attending PGCon Ottawa because I accepted the Helsinki conference invitation before the PGCon Ottawa date was changed from June to May). As part of the keynote, I would like to mention areas where academia can help us. The topics I can think of are: Query optimization Optimizer statistics Indexing structures Reducing function call overhead CPU locality Sorting Parallelism Sharding Any others? Incremental materialized views? I don't know. Is that something academics would research? I am not sure. There is definitely a question which views can be incrementally recalculated and which inductive extension has to be constructed to make it possible. If you google for "incremental materialized views phd", you will get a larger number of references to articles. But I do not know if all question in this area are already closed or not... -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Academic help for Postgres
On 11.05.2016 17:20, Bruce Momjian wrote: I am giving a keynote at an IEEE database conference in Helsinki next week (http://icde2016.fi/). (Yes, I am not attending PGCon Ottawa because I accepted the Helsinki conference invitation before the PGCon Ottawa date was changed from June to May). As part of the keynote, I would like to mention areas where academia can help us. The topics I can think of are: Query optimization Optimizer statistics Indexing structures Reducing function call overhead CPU locality Sorting Parallelism Sharding Any others? Incremental materialized views? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous and vectorized execution
On 10.05.2016 20:26, Robert Haas wrote: At this moment (February) them have implemented translation of only few PostgreSQL operators used by ExecQuals and do not support aggregates. Them get about 2 times increase of speed at synthetic queries and 25% increase at TPC-H Q1 (for Q1 most critical is generation of native code for aggregates, because ExecQual itself takes only 6% of time for this query). Actually these 25% for Q1 were achieved not by using dynamic code generation, but switching from PULL to PUSH model in executor. It seems to be yet another interesting PostgreSQL executor transformation. As far as I know, them are going to publish result of their work to open source... Interesting. You may notice that in "asynchronous mode" my prototype works using a push model of sorts. Maybe that should be taken further. Latest information from ISP RAS guys: them have made good progress since February: them have rewritten most of methods of Scan, Aggregate and Join to LLVM API. Also then implemented automatic translation of PostgreSQL backend functions to LLVM API. As a result time of TPC-H Q1 query is reduced four times. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous and vectorized execution
On 11.05.2016 17:00, Robert Haas wrote: On Tue, May 10, 2016 at 3:42 PM, Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: Doesn't this actually mean that we need to have normal job scheduler which is given queue of jobs and having some pool of threads will be able to orginize efficient execution of queries? Optimizer can build pipeline (graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan, Sort, ... Each task is splitted into several jobs which can be concurretly scheduled by task dispatcher. So you will not have blocked worker waiting for something and all system resources will be utilized. Such approach with dispatcher allows to implement quotas, priorities,... Also dispatches can care about NUMA and cache optimizations which is especially critical on modern architectures. One more reference: http://db.in.tum.de/~leis/papers/morsels.pdf I read this as a proposal to redesign the entire optimizer and executor to use some new kind of plan. That's not a project I'm willing to entertain; it is hard to imagine we could do it in a reasonable period of time without introducing bugs and performance regressions. I think there is a great deal of performance benefit that we can get by changing things incrementally. Yes, I agree with you that complete rewriting of optimizer is huge project with unpredictable influence on performance of some queries. Changing things incrementally is good approach, but only if we are moving in right direction. I still not sure that introduction of async. operations is step in right direction. Async.ops are used to significantly complicate code (since you have to maintain state yourself). It will be bad if implementation of each node has to deal with async state itself in its own manner. My suggestion is to try to provide some generic mechanism for managing state transition and have some scheduler which controls this process. It should not be responsibility of node implementation to organize asynchronous/parallel execution. Instead of this it should just produce set of jobs which execution should be controlled by scheduler. First implementation of scheduler can be quite simple. But later in can become more clever: try to bind data to processors and do many other optimizations. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous and vectorized execution
On 05/10/2016 08:26 PM, Robert Haas wrote: On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik <k.knizh...@postgrespro.ru> wrote: What's wrong with it that worker is blocked? You can just have more workers (more than CPU cores) to let other of them continue to do useful work. Not really. The workers are all running the same plan, so they'll all make the same decision about which node needs to be executed next. If that node can't accommodate multiple processes trying to execute it at the same time, it will have to block all of them but the first one. Adding more processes just increases the number of processes sitting around doing nothing. Doesn't this actually mean that we need to have normal job scheduler which is given queue of jobs and having some pool of threads will be able to orginize efficient execution of queries? Optimizer can build pipeline (graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan, Sort, ... Each task is splitted into several jobs which can be concurretly scheduled by task dispatcher. So you will not have blocked worker waiting for something and all system resources will be utilized. Such approach with dispatcher allows to implement quotas, priorities,... Also dispatches can care about NUMA and cache optimizations which is especially critical on modern architectures. One more reference: http://db.in.tum.de/~leis/papers/morsels.pdf Sorry, may be I wrong, but I still think that async.ops is "multitasking for poor":) Yes, maintaining threads and especially separate processes adds significant overhead. It leads to extra resources consumption and context switches are quite expensive. And I know from my own experience that replacing several concurrent processes performing some IO (for example with sockets) with just one process using multiplexing allows to increase performance. But still async. ops. is just a way to make programmer responsible for managing state machine instead of relying on OS tomake context switches. Manual transmission is still more efficient than automatic transmission. But still most drives prefer last one;) Seriously, I carefully read your response to Kochei, but still not convinced that async. ops. is what we need. Or may be we just understand different thing by this notion. But there are some researches, for example: http://www.vldb.org/pvldb/vol4/p539-neumann.pdf showing that the same or even better effect can be achieved by generation native code for query execution plan (which is not so difficult now, thanks to LLVM). It eliminates interpretation overhead and increase cache locality. I get similar results with my own experiments of accelerating SparkSQL. Instead of native code generation I used conversion of query plans to C code and experiment with different data representation. "Horisontal model" with loading columns on demands shows better performance than columnar store. Yes, I think this approach should also be considered. At this moment (February) them have implemented translation of only few PostgreSQL operators used by ExecQuals and do not support aggregates. Them get about 2 times increase of speed at synthetic queries and 25% increase at TPC-H Q1 (for Q1 most critical is generation of native code for aggregates, because ExecQual itself takes only 6% of time for this query). Actually these 25% for Q1 were achieved not by using dynamic code generation, but switching from PULL to PUSH model in executor. It seems to be yet another interesting PostgreSQL executor transformation. As far as I know, them are going to publish result of their work to open source... Interesting. You may notice that in "asynchronous mode" my prototype works using a push model of sorts. Maybe that should be taken further. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] asynchronous and vectorized execution
Hi, > 1. asynchronous execution, It seems to me that asynchronous execution can be considered as alternative to multithreading model (in case of PostgreSQL the roles of threads are played by workers). Async. operations are used to have smaller overhead but have scalability problems (because in most implementation of cooperative multitasking there is just one processing thread and so it can not consume more than one core). So I wonder whether asynchronous execution is trying to achieve that same goal as parallel query execution but using slightly different mechanism. You wrote: > in the meantime, any worker that arrives at that scan node has no choice but > to block. What's wrong with it that worker is blocked? You can just have more workers (more than CPU cores) to let other of them continue to do useful work. But I agree that > Whether or not this will be efficient is a research question > 2. vectorized execution Vector IO is very important for columnar store. In IMCS extension (in-memory columnar store) using vector operations allows to increase speed 10-100 times depending on size of data set and query. Obviously the best results are for grand aggregates. But there are some researches, for example: http://www.vldb.org/pvldb/vol4/p539-neumann.pdf showing that the same or even better effect can be achieved by generation native code for query execution plan (which is not so difficult now, thanks to LLVM). It eliminates interpretation overhead and increase cache locality. I get similar results with my own experiments of accelerating SparkSQL. Instead of native code generation I used conversion of query plans to C code and experiment with different data representation. "Horisontal model" with loading columns on demands shows better performance than columnar store. As far as I know native code generator is currently developed for PostgreSQL by ISP RAN Sorry, slides in Russian: https://pgconf.ru/media/2016/02/19/6%20Мельник%20Дмитрий%20Михайлович,%2005-02-2016.pdf At this moment (February) them have implemented translation of only few PostgreSQL operators used by ExecQuals and do not support aggregates. Them get about 2 times increase of speed at synthetic queries and 25% increase at TPC-H Q1 (for Q1 most critical is generation of native code for aggregates, because ExecQual itself takes only 6% of time for this query). Actually these 25% for Q1 were achieved not by using dynamic code generation, but switching from PULL to PUSH model in executor. It seems to be yet another interesting PostgreSQL executor transformation. As far as I know, them are going to publish result of their work to open source... On May 9, 2016, at 8:33 PM, Robert Haas wrote: > Hi, > > I realize that we haven't gotten 9.6beta1 out the door yet, but I > think we can't really wait much longer to start having at least some > discussion of 9.7 topics, so I'm going to go ahead and put this one > out there. I believe there are other people thinking about these > topics as well, including Andres Freund, Kyotaro Horiguchi, and > probably some folks at 2ndQuadrant (but I don't know exactly who). To > make a long story short, I think there are several different areas > where we should consider major upgrades to our executor. It's too > slow and it doesn't do everything we want it to do. The main things > on my mind are: > > 1. asynchronous execution, by which I mean the ability of a node to > somehow say that it will generate a tuple eventually, but is not yet > ready, so that the executor can go run some other part of the plan > tree while it waits. This case most obviously arises for foreign > tables, where it makes little sense to block on I/O if some other part > of the query tree could benefit from the CPU; consider SELECT * FROM > lt WHERE qual UNION SELECT * FROM ft WHERE qual. It is also a problem > for parallel query: in a parallel sequential scan, the next worker can > begin reading the next block even if the current block hasn't yet been > received from the OS. Whether or not this will be efficient is a > research question, but it can be done. However, imagine a parallel > scan of a btree index: we don't know what page to scan next until we > read the previous page and examine the next-pointer. In the meantime, > any worker that arrives at that scan node has no choice but to block. > It would be better if the scan node could instead say "hey, thanks for > coming but I'm really not ready to be on-CPU just at the moment" and > potentially allow the worker to go work in some other part of the > query tree. For that worker to actually find useful work to do > elsewhere, we'll probably need it to be the case either that the table > is partitioned or the original query will need to involve UNION ALL, > but those are not silly cases to worry about, particularly if we get > native partitioning in 9.7. > > 2. vectorized execution, by which I mean the ability of a node to > return tuples in
Re: [HACKERS] Batch update of indexes
On 04/02/2016 09:57 PM, Tom Lane wrote: Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes: Attached please find patch for "ALTER INDEX ... WHERE ..." clause. It is now able to handle all three possible situations: 1. Making index partial (add WHERE condition to the ordinary index) 2. Extend partial index range (less restricted index predicate) 3. Arbitrary change of partial index predicate I've not been following this thread previously, but this proposal scares me quite a lot. I am certain there are places in our code that assume that the properties of an index don't change after it's been created. One area that this almost certainly breaks is HOT updates: adding a previously-unindexed column to an index predicate might break existing HOT chains, and I see nothing in this patch that could deal with that. I seem to recall there are other places that would be broken by changing an index's DDL definition after creation, but can't recall specifics right now. I am also, frankly, not seeing a use-case for this functionality that would justify trying to find and remove those assumptions. There's a lot of things I don't care for about the way the patch is written, in particular its willingness to use SPI (which opens a lot of potential for search-path problems, failure to see uncommitted tuples, etc). But we need not get to that if we don't believe the functionality can work. Thank you for review, Tom. I completely agree with all your arguments against this patch. I have proposed this patch mostly as prove of concept. Yes, I have not take in account hot updates and may be there are other possible issues which I not considered. The main question is whether the proposed way of batch update of indexes is viable or it is conceptually wrong approach (because it beaks assumption that index properties can't be changed or because it is not convenient to use...). I hope that everybody agree that maintaining of indexes is the main limiting factor for insert speed. If table has no indexes, then insert speed can be as high as disk write speed (100Mb/sec or 1000 for SSD). So if size of record is about 10 bytes, then we can get about 10 millions TPS. But presence of indexes will dramatically change this picture: if database is large enough so that even index can not fit in memory and records are inserted in random key order, then each insert in index will require reading of 3-4 pages from random locations on the disk. With average HDD positioning time 10 msec, we get 100 reads per second and ... 20-30 TPS. It is just with one index. If we have 10 indexes, then TPS can be less than fingers on a hand. Certainly it is very pessimistic estimation. But still it is true that we can not provide good insert speed if we have to update indexes immediately. And without indexes we can not efficiently execute most of queries. I do not see any way in Postgres to solve this problem now. The hack with creating materialized views requires a lot of extra time and space. It will not work for really large table. So we need some way to postpone insertion of new records in the index. Then we can do such insertion in background or in idle time (at night), try to use bulk insert if index implementation supports it (for example sorting records by key before insert can significantly increase locality and so improve speed of insert in index). But the principle moment here is that such delayed update of index violates the main RDBMS rule that results of query execution with and without indexes should be the same. The trick with partial indexes allows to eliminate this contradiction. But it requires more actions from user. So are users ready to do some exatra job just because of "idealogical" reasons? Because if user wants to have delayed update of indexes, then he actually approves that it is ok for him that query results may not include some most recent updates. Another aspect is which database objects are allowed to be altered and which not. Right now with tables we can alter almost everything. With indexes - almost nothing. It is assumed that index can always be reconstructed. But for very big table reconstruction of indexes from scratch will take unacceptable amount of time. So should we make it possible to alter some index characteristics which do not require to rebuild index from scratch (and it is definitely true for partial index predicate)? Or price of supporting it is so high, that it can not be compensated by obtained benefits? So what do you think? 1. Should I continue work in this direction and fix all possible issues with hot updates,... to make it possible to alter partial index predicates and support batch inserts i this way? 2. Or it is better to just add extra option to the index, allowing it to be slightly out-of-sync? It will allow, for example, to eliminate pending list for GIN which can cause very significant degradation of query speed, while for mos
[HACKERS] Very small patch for decode.c
diff --git a/src/backend/replication/logical/decode.c b/src/backend/replication/logical/decode.c index 2380ea2..a992662 100644 --- a/src/backend/replication/logical/decode.c +++ b/src/backend/replication/logical/decode.c @@ -488,7 +488,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer *buf, { XLogRecPtr origin_lsn = InvalidXLogRecPtr; TimestampTz commit_time = parsed->xact_time; - XLogRecPtr origin_id = XLogRecGetOrigin(buf->record); + RepOriginId origin_id = XLogRecGetOrigin(buf->record); int i; -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] avg,first,last,median in one query
Hi, hackers. I need advice from SQL experts: is there any way in PostgreSQL to calculate avg,first,last,median aggregates in one query? Assume that we have the following table: create table Securities ("Symbol" varchar, "Date" date, "Time" time, "Price" real); We can simulate median using percentile_disc: select "Symbol","Date", avg("Price"), percentile_disc(0.5) within group (order by "Price") from Securities group by "Symbol","Date"; And all other aggregates can be calculated using windows functions: select distinct "Symbol","Date", first_value("Price") over (partition by "Symbol","Date" order by "Time" rows between unbounded preceding and unbounded following), last_value("Price") over (partition by "Symbol","Date" order by "Time" rows between unbounded preceding and unbounded following), avg("Price") over (partition by "Symbol","Date" rows between unbounded preceding and unbounded following) from Securities; I wonder is there are any simpler/efficient alternative to the query above? But unfortunately it is not possible to calculate median is such way because percentile_disc is not compatible with OVER: ERROR: OVER is not supported for ordered-set aggregate percentile_disc So is there any chance to calculate all this four aggregates in one query without writing some supplementary functions? Additional question: what is the most efficient way of calculating MEDIAN in PostgreSQL? I found three different approaches: 1. Using CTE: https://www.periscopedata.com/blog/medians-in-sql.html 2. Using user-defined aggregate function which uses array_appendand so materialize all values in memory: https://wiki.postgresql.org/wiki/Aggregate_Median 3. Using percentile aggregate: http://blog.jooq.org/2015/01/06/how-to-emulate-the-median-aggregate-function-using-inverse-distribution-functions/ Thanks in advance, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Applying logical replication changes by more than one process
On Mar 22, 2016, at 11:14 AM, Petr Jelinek wrote: > > And each slot means connection with logical decoding attached to it so you > don't really want to have thousands of those anyway. I think you'll hit other > problems faster than loop over slots becomes problem if you plan to keep all > of them active. Assume that cluster have thousands of nodes and we use sharding to scatter data through cluster nodes. But to provide HA we want to perform sharding with some level of redundancy, for example save the same record at 3 different nodes. Once possible approach (pg_shard) is to execute the same query at three different shards. But there is no warranty that result of execution will be the same at all nodes. Alternative approach is to execute transaction at one node and then replicate it using logical replication to replicas. So we do not perform logical replication to all 1000 nodes. Just to 2 of them. But each time it will be different pair of nodes. So we still need to have 1000 active replication slots. May be logical replication can not be used at all in such scenario - I have not thought much about it yet. Our first step will be multimaster without sharding. > > -- > Petr Jelinek http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Applying logical replication changes by more than one process
On Mar 22, 2016, at 10:10 AM, Craig Ringer wrote: > On 22 March 2016 at 14:32, konstantin knizhnik <k.knizh...@postgrespro.ru> > wrote: > >> Ah you mean because with wal_log=true the origin advance is in different WAL >> record than commit? OK yeah you might be one transaction behind then, true. > > It actually means that we can not enforce database consistency. If we do > replorigin_advance before commit and then crash happen, then we will loose > some changes. > If we call replorigin_advance after commit but crash happen before, then some > changes can be applied multiple times. For example we can insert some record > twice (if there are no unique constraints). > Look likes the only working scenario is to setup replication session for each > commit and use locking to prevent concurrent session setup for the same slot > by multiple process, doesn't it? > > Yes. > > How would you expect it to work if you attempted to replorigin_advance > without a session? From multiple concurrent backends? I would not work. But I wonder why I would need to call replorigin_advance without a session. Please excuse me, I am not thinking now about the general case of using logical replication, I just focused on multimaster. What I need is some efficient, durable and atomic mechanism for applying changes. I see only two ways to provide atomicity: 1. Tracking of origins should be done inside xact as part of normal commit. 2. Use custom WAL records. 1) is easier now and it really works if I correctly synchronize access to slots. And surprisingly it even doesn't add substantial overhead. > > Parallel apply is complicated business. You have to make sure you apply xacts > in an order that's free from deadlocks and from insert/delete anomalies - > though you can at least detect those, ERROR that xact and all subsequent > ones, and retry. Well, this is exactly what our multimaster does. We do not try to enforce order of applying xacts. But we detect global deadlocks and use 2PC to provide data consistency. So it is not task of logical replication, it is done by DTM overriding visibility checks and transaction commit protocol using XTM. > For progress tracking to be consistent and correct you'd have to make sure > you committed strictly in the same order as upstream. Just before each commit > you can set the origin LSN and advance the replication origin, which will > commit atomically along with the commit it confirms. I don't really see the > problem. Sorry, I do not completely understand you. What you mean by "will commit atomically along with the commit it confirms"? How this atomicity will be enforced? > > I have tried it, fortunately it doesn't cause any noticeable performance > degradation. But unfortunately can't consider such approach as elegant. > Why it is actually necessary to bind replication slot to process? Why it is > not possible to have multiple concurrent sessions for the same slot? > > Especially since most slot changes LWLock- and/or spinlock-protected already. > > The client would have to manage replay confirmations appropriately so that it > doesn't confirm past the point where some other connection still needs it. > > We'd have to expose a "slot" column in pg_stat_replication and remove the > "pid" column from pg_replication_slots to handle the 1:n relationship between > slot clients and slots, and it'd be a pain to show which normal user backends > were using a slot. Not really sure how to handle that. > > To actually make this useful would require a lot more though. A way to > request that replay start from a new LSN without a full disconnect/reconnect > each time. Client-side parallel consume/apply. Inter-transaction ordering > information so the client can work out a viable xact apply order (possibly > using SSI information per the discussion with Kevin?). Etc. > > I haven't really looked into this and I suspect there are some hairy areas > involved in replaying a slot from more than one client. The reason I'm > interested in it personally is for initial replica state setup as Oleksandr > prototyped and described earlier. We could attach to the slot's initial > snapshot then issue a new replication command that, given a table name or > oid, scans the table from the snapshot and passes each tuple to a new > callback (like, but not the same as, the insert callback) on the output > plugin. > > That way clients could parallel-copy the initial state of the DB across the > same replication protocol they then consume new changes from, with no need to > make normal libpq connections and COPY initial state. > > I'm interested in being able to do parallel receive of new changes from the > slot
Re: [HACKERS] Applying logical replication changes by more than one process
On Mar 21, 2016, at 4:30 PM, Petr Jelinek wrote: > On 21/03/16 14:25, Andres Freund wrote: >> On 2016-03-21 14:18:27 +0100, Petr Jelinek wrote: >>> On 21/03/16 14:15, Andres Freund wrote: > Only when the origin is actually setup for the current session. You > need > to call the replorigin_advance yourself from your apply code. That's problematic from a durability POV. >>> >>> Huh? How come? >> >> If you use the session mechanism the replication progress is synced with >> the apply process, even if there are crashes. Crash recovery updates the >> progress. There's no such interlock with apply otherwise, and I don't >> see how you can build one with reasonable effort. >> > > Ah you mean because with wal_log=true the origin advance is in different WAL > record than commit? OK yeah you might be one transaction behind then, true. It actually means that we can not enforce database consistency. If we do replorigin_advance before commit and then crash happen, then we will loose some changes. If we call replorigin_advance after commit but crash happen before, then some changes can be applied multiple times. For example we can insert some record twice (if there are no unique constraints). Look likes the only working scenario is to setup replication session for each commit and use locking to prevent concurrent session setup for the same slot by multiple process, doesn't it? I have tried it, fortunately it doesn't cause any noticeable performance degradation. But unfortunately can't consider such approach as elegant. Why it is actually necessary to bind replication slot to process? Why it is not possible to have multiple concurrent sessions for the same slot? Also I concern about using sequential search for slot location in replorigin_session_setup and many other functions - there is loop through all max_replication_slots. It seems to be not a problem when number of slots is less than 10. For multimaster this assumption is true - even Oracle RAC rarely has two-digit number of nodes. But if we want to perform sharding and use logical replication for providing redundancy, then number of nodes and slots can be essentially larger. I didn't think much about such configuration - may be it possible to propose more efficient mechanism for replication in this case. > > -- > Petr Jelinek http://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Training & Services
Re: [HACKERS] Applying logical replication changes by more than one process
On 21.03.2016 15:10, Petr Jelinek wrote: Hi, On 19/03/16 11:46, Konstantin Knizhnik wrote: Hi, I am trying to use logical replication mechanism in implementation of PostgreSQL multimaster and faced with one conceptual problem. Originally logical replication was intended to support asynchronous replication. In this case applying changes by single process should not be a bottleneck. But if we are using distributed transaction manager to provide global consistency, then applying transaction by one worker leads to very bad performance and what is worser: cause unintended serialization of transactions, which is not taken in account by distributed deadlock detection algorithm and so can cause undetected deadlocks. So I have implemented pool of background workers which can apply transactions concurrently. It works and shows acceptable performance. But now I am thinking about HA and tracking origin LSNs which are needed to correctly specify slot position in case of recovery. And there is a problem: as far as I understand to correctly record origin LSN in WAL and advance slot position it is necessary to setup session using replorigin_session_setup. It is not so convenient in case of using pool of background workers, because we have to setup session for each commit. But the main problem is that for each slot session can be associated only with one process: else if (curstate->acquired_by != 0) { ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), errmsg("replication identifier %d is already active for PID %d", curstate->roident, curstate->acquired_by))); } Which once again means that there can be only one process applying changes. That's not true, all it means is that you can do replorigin_session_setup for same origin only in one process but you don't need to have it setup for session to update it, the replorigin_advance() works just fine. But RecordTransactionCommit is using replorigin_session_advance, not replorigin_advance. And replorigin_session_advance requires that session was setup: void replorigin_session_advance(XLogRecPtr remote_commit, XLogRecPtr local_commit) { Assert(session_replication_state != NULL); } "session_replication_state" is private variable which is set by replorigin_session_setup. But attempt to call replorigin_session_setup from multiple process cause above error. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Applying logical replication changes by more than one process
Hi, I am trying to use logical replication mechanism in implementation of PostgreSQL multimaster and faced with one conceptual problem. Originally logical replication was intended to support asynchronous replication. In this case applying changes by single process should not be a bottleneck. But if we are using distributed transaction manager to provide global consistency, then applying transaction by one worker leads to very bad performance and what is worser: cause unintended serialization of transactions, which is not taken in account by distributed deadlock detection algorithm and so can cause undetected deadlocks. So I have implemented pool of background workers which can apply transactions concurrently. It works and shows acceptable performance. But now I am thinking about HA and tracking origin LSNs which are needed to correctly specify slot position in case of recovery. And there is a problem: as far as I understand to correctly record origin LSN in WAL and advance slot position it is necessary to setup session using replorigin_session_setup. It is not so convenient in case of using pool of background workers, because we have to setup session for each commit. But the main problem is that for each slot session can be associated only with one process: else if (curstate->acquired_by != 0) { ereport(ERROR, (errcode(ERRCODE_OBJECT_IN_USE), errmsg("replication identifier %d is already active for PID %d", curstate->roident, curstate->acquired_by))); } Which once again means that there can be only one process applying changes. To provide correct state of replication node it is necessary to enforce that each logical replication record is replayed exactly once: we should not loose some change or try to apply it twice. So operation of recording original LSN position in WAL and adjusting slot should be atomic. And during recovery we should restore slot current position based on the origin values extracted from WAL. I wonder if it can be done using current logical replication mechanism when changes of each slot are applied by more than one process? Or the only alternative is to write/read origin LSNs in WAL myself, for example using custom WAL records? Thanks in advance! -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch update of indexes
Hi David, Rebased patch is attached. On 14.03.2016 15:09, David Steele wrote: Hi Konstantin, On 2/3/16 11:47 AM, Konstantin Knizhnik wrote: Attached please find patch for "ALTER INDEX ... WHERE ..." clause. It is now able to handle all three possible situations: 1. Making index partial (add WHERE condition to the ordinary index) 2. Extend partial index range (less restricted index predicate) 3. Arbitrary change of partial index predicate In case 2) new records are added to the index. In other two cases index is completely reconstructed. This patch includes src/bin/insbench utility for testing insert performance. It can be easily excluded from the patch to reduce it size. Also it is better to apply this patch together with "index-only scans with partial indexes" patch: This patch no longer applies on master: $ git apply ../other/alter-index.patch ../other/alter-index.patch:27: trailing whitespace. static void ../other/alter-index.patch:62: indent with spaces. SPIPlanPtr plan; ../other/alter-index.patch:63: indent with spaces. Portal portal; ../other/alter-index.patch:90: trailing whitespace. ../other/alter-index.patch:99: trailing whitespace. error: patch failed: src/test/regress/expected/aggregates.out:831 error: src/test/regress/expected/aggregates.out: patch does not apply Please provide a new patch for review. Meanwhile I am marking this "waiting on author". Thanks, -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/doc/src/sgml/ref/alter_index.sgml b/doc/src/sgml/ref/alter_index.sgml index ee3e3de..ab042ed 100644 --- a/doc/src/sgml/ref/alter_index.sgml +++ b/doc/src/sgml/ref/alter_index.sgml @@ -27,6 +27,7 @@ ALTER INDEX [ IF EXISTS ] name SET ALTER INDEX [ IF EXISTS ] name RESET ( storage_parameter [, ... ] ) ALTER INDEX ALL IN TABLESPACE name [ OWNED BY role_name [, ... ] ] SET TABLESPACE new_tablespace [ NOWAIT ] +ALTER INDEX name WHERE predicate @@ -109,6 +110,18 @@ ALTER INDEX ALL IN TABLESPACE name + +WHERE predicate + + + Add or change predicate of partial index. Extending partial index predicate allows to implement batch update of index and so + increase insert speed. New records (not matching index predicate) can be added to the table at maximal speed without affecting indexes. + Later, in background, indexes can be refreshed using ALTER INDEX ... WHERE ... clause. + See for more discussion. + + + + diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 13b04e6..a63de2a 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -34,6 +34,7 @@ #include "commands/tablespace.h" #include "mb/pg_wchar.h" #include "miscadmin.h" +#include "funcapi.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/planner.h" @@ -52,6 +53,9 @@ #include "utils/snapmgr.h" #include "utils/syscache.h" #include "utils/tqual.h" +#include "utils/ruleutils.h" +#include "executor/executor.h" +#include "executor/spi.h" /* non-export function prototypes */ @@ -280,6 +284,160 @@ CheckIndexCompatible(Oid oldId, return ret; } +static void +UpdateIndex(Oid indexRelationId, Node* whereClause) +{ + Datum values[Natts_pg_index]; + bool isnull[Natts_pg_index]; + HeapTuple oldTuple; + HeapTuple newTuple; + Relation pg_index; + + pg_index = heap_open(IndexRelationId, RowExclusiveLock); + oldTuple = SearchSysCacheCopy1(INDEXRELID, ObjectIdGetDatum(indexRelationId)); + if (!HeapTupleIsValid(oldTuple)) + elog(ERROR, "cache lookup failed for index %u", indexRelationId); + + heap_deform_tuple(oldTuple, RelationGetDescr(pg_index), values, isnull); + values[Anum_pg_index_indpred - 1] = CStringGetTextDatum(nodeToString(whereClause)); + isnull[Anum_pg_index_indpred - 1] = false; + newTuple = heap_form_tuple(RelationGetDescr(pg_index), values, isnull); + simple_heap_update(pg_index, >t_self, newTuple); + CatalogUpdateIndexes(pg_index, newTuple); + heap_freetuple(newTuple); + heap_freetuple(oldTuple); + heap_close(pg_index, NoLock); +} + +void +AlterIndex(Oid indexRelationId, IndexStmt *stmt) +{ + char* select; + Oid heapRelationId; + IndexUniqueCheck checkUnique; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + Relation heapRelation; + Relation indexRelation; +SPIPlanPtr plan; +Portal portal; + HeapTuple tuple; + TupleTableSlot *slot; + ItemPointer tupleid; + IndexInfo *indexInfo; + EState *estate; + Oid namespaceId; + List* deparseCtx; + char* oldIndexPredicate; + char* newIndexPredicate; + char* relationName; + + Assert(stmt->whereClause); + CheckPredicate((Expr *) stmt->whereClause); + + /* Open and lock the pare
Re: [HACKERS] eXtensible Transaction Manager API (v2)
ter, so we have to introduce "in-doubt" state of transaction, when it is not known whether transaction is committed or aborted and any other transaction accessing tuples updated but this transaction has to wait while its status is "in-doubt". The main challenge of pg_tsdtm is to make this period as short as possible... But it is details of particular implementation which IMHO have no relation to API itself. 3. Uh, how can you hook GetNewTransactionId but not ReadNewTransactionId? Uh-uh-uh:) ReadNewTransactionId is just reading value of ShmemVariableCache->nextXid, but unfortunately it is not the only point where nextXid is used - there are about hundred occurrences of nextXid in Postgres core. This is why we made a decision that GetNewTransactionId should actually update ShmemVariableCache->nextXid, so that there is no need to rewrite all this code. Sorry, but IMHO it is problem of Postgres design and not of XTM;) We just want to find some compromise which allows XTM to be flexible enough but minimize changes in core code. 4. There seems to be an intention to encapsulate snapshots, but surely wrapping hooks around GetSnapshotData and XidInMVCCSnapshot is not nearly enough for that. Look at all the knowledge snapmgr.c has about snapshot representation, for example. And is a function like GetOldestXmin even meaningful with a different notion of what snapshots are? (For that matter, is TransactionId == uint32 still tenable for any other notion of snapshots?) XTM encapsulation of snapshots allows us to implement pg_dtm. It does almost the same as Postgres-XL GTM, but without huge amount of #ifdefs. Representation of XID is yet another compromise point: we do not want to change tuple header format. So XID is still 32 bit and has the same meanining as in PostgreSQL. If custom implementation of TM wants to use some other identifiers of transactions, like CSN in pg_tsdtm, it has to provide mapping between them and XIDs. 5. BTW, why would you hook at XidInMVCCSnapshot rather than making use of the existing capability to have a custom SnapshotSatisfiesFunc snapshot checker function? HeapTupleSatisfies routines in times/tqual.c have implemented a lot of logic of handling different kind of snapshots, checking/setting hint bits in tuples, caching,... We do not want to replace or just cut all this code in DTM implementation. And XidInMVCCSnapshot is common function finally used by most HeapTupleSatisfies* functions when all other checks are passed. So it is really the most convenient place to plug-in custom visibility checking rules. And as far as I remember similar approach was used in Postgres-XL. IMO this is not committable as-is, and I don't think that it's something that will become committable during this 'fest. I think we'd be well advised to boot it to the 2016-09 CF and focus our efforts on other stuff that has a better chance of getting finished this month. regards, tom lane -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizer questions
On Mar 10, 2016, at 1:56 AM, Tom Lane wrote: > Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes: >> I think that the best approach is to generate two different paths: >> original one, when projection is always done before sort and another one >> with postponed projection of non-trivial columns. Then we compare costs >> of two paths and choose the best one. >> Unfortunately, I do not understand now how to implement it with existed >> grouping_planner. >> Do you think that it is possible? > > After fooling with this awhile, I don't think it's actually necessary > to do that. See attached proof-of-concept patch. O, you did all my work:) But right now the rule for cost estimation makes it not possible to apply this optimization for simple expressions like this: postgres=# create table a (b integer); CREATE TABLE postgres=# insert into a values (generate_series(1, 10)); INSERT 0 10 postgres=# select b+b from a order by b limit 1; NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl NOTICE: int4pl ?column? -- 2 (1 row) postgres=# create or replace function twice(x integer) returns integer as $$ begin raise notice 'exec function' ; return x + x ; end $$ language plpgsql; CREATE FUNCTION postgres=# select twice(b) from a order by b limit 1; NOTICE: exec function NOTICE: int4pl twice --- 2 (1 row) I wonder is there any advantages of earlier evaluation of such simple expressions if them are not needed for sort? It seems to be only true for narrowing functions, like md5... But I think that it is quite exotic case, isn't it? May be it is reasonable to be more optimistic in estimation cost of postponed expression evaluation? Also I do not completely understand your concern about windows functions. Is there any example of query with windows or aggregate functions when this optimization (postponing expression evaluation) can be applied? It will be also interesting to me to know if there are some other cases (except SORT+LIMIT) when delaying projection leeds to more efficient plan. > > Although this patch gets through our regression tests, that's only because > it's conservative about deciding to postpone evaluation; if it tried to > postpone evaluation in a query with window functions, it would fail > miserably, because pull_var_clause doesn't know about window functions. > I think that that's a clear oversight and we should extend it to offer > the same sorts of behaviors as it does for Aggrefs. But that would be > slightly invasive, there being a dozen or so callers; so I didn't bother > to do it yet pending comments on this patch. > > I think it's probably also broken for SRFs in the tlist; we need to work > out what semantics we want for those. If we postpone any SRF to after > the Sort, we can no longer assume that a query LIMIT enables use of > bounded sort (because the SRF might repeatedly return zero rows). > I don't have a huge problem with that, but I think now would be a good > time to nail down some semantics. > > Some other work that would be useful would be to refactor so that the > cost_qual_eval operations aren't so redundant. But that's just a small > time savings not a question of functionality. > > And we'd have to document that this changes the behavior for volatile > functions. For the better, I think: this will mean that you get > consistent results no matter whether the query is implemented by > indexscan or seqscan-and-sort, which has never been true before. > But it is a change. > > Do people approve of this sort of change in general, or this patch > approach in particular? Want to bikeshed the specific > when-to-postpone-eval policies implemented here? Other comments? > > regards, tom lane > > diff --git a/src/backend/optimizer/plan/planner.c > b/src/backend/optimizer/plan/planner.c > index 8937e71..b15fca1 100644 > *** a/src/backend/optimizer/plan/planner.c > --- b/src/backend/optimizer/plan/planner.c > *** static RelOptInfo *create_distinct_paths > *** 126,131 > --- 126,132 > RelOptInfo *input_rel); > static RelOptInfo *create_ordered_paths(PlannerInfo *root, >RelOptInfo *input_rel, > + PathTarget *target, >double limit_tuples); > static PathTarget *make_group_input_target(PlannerInfo *root, List *tlist); > static List *postprocess_setop_tlist(List *new_tlist, List *orig_tlist); > *** static PathTarget *make_window_input_tar > *** 134,139 > --- 135,142 >
Re: [HACKERS] Optimizer questions
On 09.03.2016 09:15, Tom Lane wrote: I wrote: BTW, there's some additional refactoring I had had in mind to do in grouping_planner to make its handling of the targetlist a bit more organized; in particular, I'd like to see it using PathTarget representation more consistently throughout the post-scan-join steps. See 51c0f63e4d76a86b44e87876a6addcfffb01ec28 --- I think this gets things to where we could plug in additional levels of targets without too much complication. regards, tom lane So, if I correctly understand you, there are two major concerns: 1. Volatile functions. I wonder if it is really required to reevaluate volatile function for each record even if LIMIT clause is present? Documentation says: "A query using a volatile function will re-evaluate the function at every row where its value is needed." So if we are using sort with limit and value of function is not used in sort, then we it is correct to say that value of this function is no needed, so there is no need to re-evaluate it, isn't it? 2. Narrowing functions, like md5. Here I do not have any good idea how to support it. Looks like cost of SORT should depend on tuple width. Only in this case optimizer can determine whether it is more efficient to evaluate function earlier or postpone its execution. I think that the best approach is to generate two different paths: original one, when projection is always done before sort and another one with postponed projection of non-trivial columns. Then we compare costs of two paths and choose the best one. Unfortunately, I do not understand now how to implement it with existed grouping_planner. Do you think that it is possible? Alternative approach is to do something like in my proposed patch, but take in account cost of function execution and check presence of volatile/narrowing functions. This approach provides better flexibility, because we can choose subset of columns not-used in sort, which evaluation should be postponed. But here we once again make local decision while construction of the path instead of comparing costs of full paths. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizer questions
On 03/08/2016 07:01 AM, Tom Lane wrote: Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes: Attached please find improved version of the optimizer patch for LIMIT clause. This patch isn't anywhere close to working after 3fc6e2d7f5b652b4. (TBH, the reason I was negative about this upthread is that I had that one in the oven and knew it would conflict spectacularly.) I encourage you to think about how an optimization of this sort could be made to work in a non-kluge fashion in the new code structure. I've not spent a lot of time on this, but I think maybe what would make sense is to consider both the case where function calculations are postponed to after ORDER BY and the case where they aren't, and generate Paths for both. Neither approach is a slam-dunk win. For example, suppose that one of the tlist columns is md5(wide_column) --- it will likely not be preferable to pass the wide column data through the sort step rather than reducing it to a hash first. This would require some work in grouping_planner to track two possible pathtargets, and work in create_ordered_paths to generate paths for both possibilities. A possible objection is that this would add planning work even when no real benefit is possible; so maybe we should only consider the new way if the tlist has significant eval cost? Not sure about that. There is also something to be said for the idea that we should try to guarantee consistent semantics when the tlist contains volatile functions. For now, I've set this commitfest entry to Waiting on Author. There's still time to consider a rewrite in this 'fest, if you can get it done in a week or two. regards, tom lane Attached please find rebased patch. Unfortunately 3fc6e2d7f5b652b4 still doesn't fix the problem with "lazy" evaluation of target list. This is why my patch is still useful. But frankly speaking I am not sure that it is best way of fixing this problem, because it takes in account only one case: sort+limit. May be the same optimization can be useful for other queries. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c index 5fc8e5b..709d1ad 100644 --- a/src/backend/optimizer/plan/planner.c +++ b/src/backend/optimizer/plan/planner.c @@ -126,8 +126,9 @@ static RelOptInfo *create_ordered_paths(PlannerInfo *root, RelOptInfo *input_rel, double limit_tuples); static PathTarget *make_scanjoin_target(PlannerInfo *root, List *tlist, - AttrNumber **groupColIdx); + AttrNumber **groupColIdx, bool* splitted_projection); static int get_grouping_column_index(Query *parse, TargetEntry *tle); +static int get_sort_column_index(Query *parse, TargetEntry *tle); static List *postprocess_setop_tlist(List *new_tlist, List *orig_tlist); static List *select_active_windows(PlannerInfo *root, WindowFuncLists *wflists); static List *make_windowInputTargetList(PlannerInfo *root, @@ -1381,6 +1382,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, RelOptInfo *current_rel; RelOptInfo *final_rel; ListCell *lc; + bool splitted_projection = false; /* Tweak caller-supplied tuple_fraction if have LIMIT/OFFSET */ if (parse->limitCount || parse->limitOffset) @@ -1657,7 +1659,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, * that were obtained within query_planner(). */ sub_target = make_scanjoin_target(root, tlist, - ); + , _projection); /* * Forcibly apply that tlist to all the Paths for the scan/join rel. @@ -1801,6 +1803,13 @@ grouping_planner(PlannerInfo *root, bool inheritance_update, { Path *path = (Path *) lfirst(lc); + if (splitted_projection) + { + path = apply_projection_to_path(root, current_rel, + path, create_pathtarget(root, tlist)); + } + + /* * If there is a FOR [KEY] UPDATE/SHARE clause, add the LockRows node. * (Note: we intentionally test parse->rowMarks not root->rowMarks @@ -3775,15 +3784,17 @@ create_ordered_paths(PlannerInfo *root, static PathTarget * make_scanjoin_target(PlannerInfo *root, List *tlist, - AttrNumber **groupColIdx) + AttrNumber **groupColIdx, + bool* splitted_projection) { Query *parse = root->parse; - List *sub_tlist; - List *non_group_cols; + List *sub_tlist = NIL; + List *non_group_cols = NIL; List *non_group_vars; int numCols; *groupColIdx = NULL; + *splitted_projection = false; /* * If we're not grouping or aggregating or windowing, there's nothing to @@ -3791,14 +3802,66 @@ make_scanjoin_target(PlannerInfo *root, */ if (!parse->hasAggs && !parse->groupClause && !parse->groupingSets && !root->hasHavingQual && !parse->hasWindowFuncs) + { + if (parse->sortClause && l
Re: [HACKERS] The plan for FDW-based sharding
is especially true for DTM, because requirements of various cluster solution are very different. And the most convenient way of doing it is to ship DTM as extension, not as some fork of Postgres. It will greatly simplify using it. Now, on connection pooling, I am similarly not opposed to having some well-designed hooks, but I also think in the long run it would be better for some improvements in this area to be part of core. None of that means I would support any particular hook proposal, of course. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
On 01.03.2016 22:02, Bruce Momjian wrote: On Tue, Mar 1, 2016 at 07:56:58PM +0100, Petr Jelinek wrote: Note that I am not saying that other discussed approaches are any better, I am saying that we should know approximately what we actually want and not just beat FDWs with a hammer and hope sharding will eventually emerge and call that the plan. I will say it again --- FDWs are the only sharding method I can think of that has a chance of being accepted into Postgres core. It is a plan, and if it fails, it fails. If is succeeds, that's good. What more do you want me to say? I know of no other way to answer the questions you asked above. I do not understand why it can fail. FDW approach may be not flexible enough for building optimal distributed query execution plans for complex OLAP queries. But for simple queries it should work fine. Simple queries corresponds OLTP and simple OLAP. For OLTP we definitely need transaction manager to provide global consistency. And we have actually prototype of integration postgres_fdw with out pg_dtm and pg_tsdtm transaction managers. The results are quite IMHO promising (see attached diagram). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company DTM-pgconf.pdf Description: Adobe PDF document -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PROPOSAL: Fast temporary tables
As far as I know we are trying to kill two birds with one stone: 1. Reduce overhead of accessing temporary tables 2. Make it possible to create temporary tables on replica. Replicas with hot-standby are widely used for running read-only OLAP queries. But such queries usually stores intermediate results in temporary tables. Unfortunately creating temporary table at read-only replica is impossible now. So some customers do the following tricks: them create pool of file FDWs at master and then use them at replicas. But IMHO it is ugly and inefficient hack. Ideally we should be able to create temporary tables at replica, not affecting system catalog. But there are a lot of problems: where it should be stores, how to assign XIDs to the ruples inserted in temporary table,... Unfortunately, looks like there is no simple solution of the problem. The 100% solution is multimaster (which we are currently developing), but it is completely different story... On 03/01/2016 10:17 PM, Jim Nasby wrote: On 3/1/16 10:05 AM, Atri Sharma wrote: Fair point, that means inventing a whole new OID generation structure.. Generation is just the tip of the iceberg. You still need the equivalent to foreign keys (ie: pg_depend). While you would never have a permanent object depend on a temp object, the reverse certainly needs to be supported. If I were attempting to solve this at a SQL level, I'd be thinking about using table inheritance such that the permanent objects are stored in a permanent parent. New backends would create UNLOGGED children off of that parent. There would be a pid column that was always NULL in the parent, but populated in children. That means children could use their own local form of an OID. When a backend terminates you'd just truncate all it's tables. Actually translating that into relcache and everything else would be a serious amount of work. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
On 03/01/2016 09:19 PM, Petr Jelinek wrote: Since this thread heavily discusses the XTM, I have question about the XTM as proposed because one thing is very unclear to me - what happens when user changes the XTM plugin on the server? I didn't see any xid handover API which makes me wonder if changes of a plugin (or for example failure to load previously used plugin due to admin error) will send server to similar situation as xid wraparound. Transaction manager is very "intimate" part of DBMS and certainly bugs and problems in custom TM implementation can break the server. So if you are providing custom TM implementation, you should take full responsibility on system integrity. XTM API itself doesn't enforce any XID handling policy. As far as we do not want to change tuple header format, XID is still 32-bit integer. In case of pg_dtm, global transactions at all nodes are assigned the same XID by arbiter. Arbiter is handling XID wraparound. In pg_tsdtm each node maintains its own XIDs, actually pg_tsdtm doesn't change way of assigning CIDs by Postgres. So wraparound in this case is handled in standard way. Instead of assigning own global XIDs, pg_tsdtm provides mapping between local XIDs and global CSNs. Visibility checking rules looks on CSNs, not on XIDs. In both cases if system is for some reasons restarted and DTM plugin failed to be loaded, you can still access database locally. No data can be lost. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
On 01.03.2016 19:03, Robert Haas wrote: On Tue, Mar 1, 2016 at 10:37 AM, Bruce Momjian <br...@momjian.us> wrote: On Tue, Mar 1, 2016 at 10:19:45AM -0500, Robert Haas wrote: Two reasons: 1. There is no ideal implementation of DTM which will fit all possible needs and be efficient for all clusters. Hmm, what is the reasoning behind that statement? I mean, it is certainly true that there are some places where we have decided that one-size-fits-all is not the right approach. Indexing, for example. Uh, is that even true of indexing? While the plug-in nature of indexing allows for easier development and testing, does anyone create plug-in indexing that isn't shipped by us? I thought WAL support was something that prevented external indexing solutions from working. True. There is an API, though, and having pluggable WAL support seems desirable too. At the same time, I don't think we know of anyone maintaining a non-core index AM ... and there are probably good reasons for that. We end up revising the index AM API pretty regularly every time somebody wants to do something new, so it's not really a stable API that extensions can just tap into. I suspect that a transaction manager API would end up similarly situated. IMHO non-stable API is better than lack of API. Just because it makes it possible to implement features in modular way. And refactoring of API is not so difficult thing... -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
Thank you very much for you comments. On 01.03.2016 18:19, Robert Haas wrote: On Sat, Feb 27, 2016 at 2:29 AM, Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: How do you prevent clock skew from causing serialization anomalies? If node receives message from "feature" it just needs to wait until this future arrive. Practically we just "adjust" system time in this case, moving it forward (certainly system time is not actually changed, we just set correction value which need to be added to system time). This approach was discussed in the article: http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf I hope, in this article algorithm is explained much better than I can do here. Hmm, the approach in that article is very interesting, but it sounds different than what you are describing - they do not, AFAICT, have anything like a "correction value" In the article them used anotion "wait": if T.SnapshotTime>GetClockTime() then wait until T.SnapshotTime<GetClockTime() Originally we really do sleep here, but then we think that instead of sleeping we can just adjust local time. Sorry, I do not have format prove it is equivalent but... at least we have not encountered any inconsistencies after this fix and performance is improved. There are well know limitation of this pg_tsdtm which we will try to address in future. How well known are those limitations? Are they documented somewhere? Or are they only well-known to you? Sorry, well know for us. But them are described at DTM wiki page. Right now pg_tsdtm is not supporting correct distributed deadlock detection (is not building global lock graph) and is detecting distributed deadlocks just based on timeouts. It doesn't support explicit locks but "select for update" will work correctly. What we want is to include XTM API in PostgreSQL to be able to continue our experiments with different transaction managers and implementing multimaster on top of it (our first practical goal) without affecting PostgreSQL core. If XTM patch will be included in 9.6, then we can propose our multimaster as PostgreSQL extension and everybody can use it. Otherwise we have to propose our own fork of Postgres which significantly complicates using and maintaining it. Well I still think what I said before is valid. If the code is good, let it be a core submission. If it's not ready yet, submit it to core when it is. If it can't be made good, forget it. I have nothing against committing DTM code in core. But still the best way of integration it is to use a-la-OO approach. So still need API. Inserting if-s or switches in existed code is IMHO ugly idea. Also it is not enough for DTM code to be just "good". It should provide expected functionality. But which functionality is expected? From my experience of development different cluster solutions I can say that different customers have very different requirements. It is very hard if ever possible to satisfy them all. Right now I do not feel that I can predict all possible requirements to DTM. This is why we want to provide some API, propose some implementations of this API, receive feedbecks and get better understanding which functionality is actually needed by customers. This seems rather defeatist. If the code is good and reliable, why should it not be committed to core? Two reasons: 1. There is no ideal implementation of DTM which will fit all possible needs and be efficient for all clusters. Hmm, what is the reasoning behind that statement? I mean, it is certainly true that there are some places where we have decided that one-size-fits-all is not the right approach. Indexing, for example. But there are many other places where we have not chosen to make things pluggable, and that I don't think it should be taken for granted that plugability is always an advantage. I fear that building a DTM that is fully reliable and also well-performing is going to be really hard, and I think it would be far better to have one such DTM that is 100% reliable than two or more implementations each of which are 99% reliable. The question is not about it's reliability, but mostly about its functionality and flexibility. 2. Even if such implementation exists, still the right way of it integration is Postgres should use kind of TM API. Sure, APIs are generally good, but that doesn't mean *this* API is good. Well, I do not what to say "better than nothing", but I find this API to be a reasonable compromise between flexibility and minimization of changes in PostgreSQL core. If you have some suggestions how to improve it, I will be glad to receive them. I hope that everybody will agree that doing it in this way: #ifdef PGXC /* In Postgres-XC, stop timestamp has to follow the timeline of GTM */ xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp; #else xlrec.xact_ti
Re: [HACKERS] The plan for FDW-based sharding
On 02/27/2016 11:38 PM, Kevin Grittner wrote: Is this an implementation of some particular formal technique? If so, do you have a reference to a paper on it? I get the sense that there has been a lot written about distributed transactions, and that it would be a mistake to ignore it, but I have not (yet) reviewed the literature for it. The reference to the article is at our WiKi pages explaining our DTM: https://wiki.postgresql.org/wiki/DTM http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
Neither pg_dtm, neither pg_tsdtm supports serializable isolation level. We implemented distributed snapshot isolation - repeatable-read isolation level. We also do not support read-committed isolation level now. We do not try to preserve transaction commit order at all nodes. But in principle it can be implemented using XTM API: it allows to redefine function which actually sets transaction status. pg_dtm performs 2PC here. And in principle it is possible to enforce commits in any particular order. Concerning CSNs, may be you are right and it is not correct to use this notion in this case. Actually there are many "CSNs" involved in transaction commit. First of all each transaction is assigned local CSN (timestamp) when it is ready to commit. Then CSNs of all nodes are exchanged and maximal CSN is chosen. This maximum is writen as final transaction CSN and is used in visibility check. On 02/27/2016 01:48 AM, Kevin Grittner wrote: On Fri, Feb 26, 2016 at 2:19 PM, Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: pg_tsdtm is based on another approach: it is using system time as CSN Which brings up an interesting point, if we want logical replication to be free of serialization anomalies for those using serializable transactions, we need to support applying transactions in an order which may not be the same as commit order -- CSN (as such) would be the wrong thing. If serializable transaction 1 (T1) modifies a row and concurrent serializable transaction 2 (T2) reads the old version of the row, and modifies something based on that, T2 must be applied to a logical replica first even if T1 commits before it; otherwise the logical replica could see a state not consistent with business rules and which could not have been seen (due to SSI) on the source database. Any DTM API which does not support some mechanism to rearrange the order of transactions from commit order to some other order (based on, for example, read-write dependencies) is not complete. If it does support that, it gives us a way forward for presenting consistent data on logical replicas. To avoid confusion, it might be best to reserve CSN for actual commit sequence numbers, or at least values which increase monotonically with each commit. The term of art for what I described above is "apparent order of execution", so maybe we want to use AOE or AOoE for the order we choose to use in a particular implementation. It doesn't seem to me to be outright inaccurate for cases where the system time on the various systems is used. -- Kevin Grittner EDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
On 02/27/2016 06:54 AM, Robert Haas wrote: On Fri, Feb 26, 2016 at 10:56 PM, Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: We do not have formal prove that proposed XTM is "general enough" to handle all possible transaction manager implementations. But there are two general ways of dealing with isolation: snapshot based and CSN based. I don't believe that for a minute. For example, consider this article: Well, I have to agree that saying that there are just two ways of providing distributed isolation I was not right. There is at least one more method: conservative locking. But it will cause huge number of extra network messages which has to be exchanged. Also I mostly considered solutions compatible with PostgreSQL MVCC model. And definitely their are other approaches. Like preserving transaction commit order (as it is done in Galera). Some other them can be implemented with XTM (preserving commit order), some - not (2PL). I have already noticed that XTM is not allowing to implement ANY transaction manager. But we have considered several approaches to distributed transaction management explained in the article related with really working systems. Some of them are real production system as SAP HANA, some are just prototypes, but working prototypes for which authors have performed some benchmarking and comparison with other approaches. The references you have mentioned are mostly theoretical description of the problem. Nice to know it but it is hard to build some concrete implementation based on this articles. Briefly answering other your questions: For example, consider a table with a million rows spread across any number of servers. It is sharding scenario, pg_tsdtm will work well in this case does not requiring sending a lot of extra messages. Now consider another workload where each transaction reads a row one one server, reads a row on another server, It can be solved both with pg_dtm (central arbiter) and pg_tsdtm (no arbiter), But actually you scenarios just once again proves that there can not be just one ideal distributed TM. So maybe the goal for the GTM isn't to provide true serializability across the cluster but some lesser degree of transaction isolation. But then exactly which serialization anomalies are we trying to prevent, and why is it OK to prevent those and not others? Absolutely agree. There are some theoretical discussion regarding CAP and different distributed level of isolation. But at practice people want to solve their tasks. Most of PostgeSQL used are using default isolation level: read committed although there are alot of "wonderful" anomalies with it. Serialazable transaction in Oracle are actually violating fundamental serializability rule and still Oracle is one of ther most popular database in the world... The was isolation bug in Postgres-XL which doesn't prevent from using it by commercial customers... So I do not say that discussing all this theoretical questions is not need as formally proven correctness of distributed algorithm. But I do not understand hot why it should prevent from providing extensible TM API. Yes, we can tot do everything with it. But still we can implement many different approaches. I think that it somehow proves that it is "general enough". https://en.wikipedia.org/wiki/Global_serializability I think the neutrality of that article is *very* debatable, but it certainly contradicts the idea that snapshots and CSNs are the only methods of achieving global serializability. Or consider this lecture: http://hssl.cs.jhu.edu/~randal/416/lectures.old/ln5.2.pdf That's a great introduction to the problem we're trying to solve here, but again, snapshots are not mentioned, and CSNs certainly aren't mentioned. This write-up goes further, explaining three different methods for ensuring global serializability, none of which mention snapshots or CSNs: http://heaven.eee.metu.edu.tr/~vision/LectureNotes/EE442/Ee442ch7.html Actually, I think the second approach is basically a snapshot/CSN-type approach, but it doesn't use that terminology and the connection to what you are proposing is very unclear. I think you're approaching this problem from a viewpoint that is entirely too focused on the code that exists in PostgreSQL today. Lots of people have done lots of academic research on how to solve this problem, and you can't possibly say that CSNs and snapshots are the only solution to this problem unless you haven't read any of those papers. The articles above aren't exceptional in mentioning neither of the approaches that you are advocating - they are typical of the literature in this area. How can it be that the only solutions to this problem are ones that are totally different from the approaches that university professors who spend time doing research on concurrency have spent time exploring? I think we need to back up here and examine our underlyin
Re: [HACKERS] Relation cache invalidation on replica
On 02/27/2016 04:16 AM, Simon Riggs wrote: On 27 February 2016 at 00:33, Simon Riggs <si...@2ndquadrant.com <mailto:si...@2ndquadrant.com>> wrote: On 27 February 2016 at 00:29, Andres Freund <and...@anarazel.de <mailto:and...@anarazel.de>> wrote: On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote: > The reason of the problem is that invalidation messages are not delivered to > replica after the end of concurrent create index. > Invalidation messages are included in xlog as part of transaction commit > record. > Concurrent index create is split into three transaction, last of which is > just performing inplace update of index tuple, marking it as valid and > invalidating cache. But as far as this transaction is not assigned XID, no > transaction record is created in WAL and send to replicas. As a result, > replica doesn't receive this invalidation messages. Ugh, that's a fairly ugly bug. Looking now. If the above is true, then the proposed fix wouldn't work either. No point in sending a cache invalidation message on the standby if you haven't also written WAL, since the catalog re-read would just see the old row. heap_inplace_update() does write WAL, which blows away the starting premise. So I'm not seeing this as an extant bug in an open source version of PostgreSQL, in my current understanding. Inplace update really creates record in WAL and this is why index is marked as valid at replica. But invalidation messages are sent only with transaction commit record and such record is not created in this case, because there is no assigned XID. This is a real bug which originally observed by one of our customers with different versions of Postgres (last one them have tried was 9.5.1). Then we reproduced it locally and determined the reason of the problem. Repro scenario is very simple: you just need to create large enough table (pgbench with scale factor 100 works well in my case) so that "create index concurrently" takes substantial amount of time. If, while this statement is in progress, you will execute some query at replica which uses this index, then it will cache state of relation without index. And after even when index is actually constructed, it will never be used in this backend (but other backends at replica will use it). I am not sure about the best way of fixing the problem. I have not tested Andreas proposal: if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval) if it actually fixes the problem. Assigning XID in heap_inplace_update definitely should work. It is better than forcing assignment XID in DefineIndex? I am not sure, because this problem seems to be related only with concurrent update (but may be I am wrong). At least not all inplace updates should cause catalog invalidation and so require XID assignment. -- Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] The plan for FDW-based sharding
On 02/27/2016 06:57 AM, Robert Haas wrote: On Sat, Feb 27, 2016 at 1:49 AM, Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: pg_tsdtm is based on another approach: it is using system time as CSN and doesn't require arbiter. In theory there is no limit for scalability. But differences in system time and necessity to use more rounds of communication have negative impact on performance. How do you prevent clock skew from causing serialization anomalies? If node receives message from "feature" it just needs to wait until this future arrive. Practically we just "adjust" system time in this case, moving it forward (certainly system time is not actually changed, we just set correction value which need to be added to system time). This approach was discussed in the article: http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf I hope, in this article algorithm is explained much better than I can do here. Few notes: 1. I can not prove that our pg_tsdtm absolutely correctly implements approach described in this article. 2. I didn't try to formally prove that our implementation can not cause some serialization anomalies. 3. We just run various synchronization tests (including simplest debit-credit test which breaks old version of Postgtes-XL) during several days and we didn't get any inconsistencies. 4. We have tested pg_tsdtm both at single node, blade cluster and geographically distributed nodes (distance more than thousand kilometers: one server was in Vladivostok, another in Kaliningrad). Ping between these two servers takes about 100msec. Performance of our benchmark drops about 100 times but there was no inconsistencies. Also I once again want to notice that primary idea of the proposed patch was not pg_tsdtm. There are well know limitation of this pg_tsdtm which we will try to address in future. What we want is to include XTM API in PostgreSQL to be able to continue our experiments with different transaction managers and implementing multimaster on top of it (our first practical goal) without affecting PostgreSQL core. If XTM patch will be included in 9.6, then we can propose our multimaster as PostgreSQL extension and everybody can use it. Otherwise we have to propose our own fork of Postgres which significantly complicates using and maintaining it. So there is no ideal solution which can work well for all cluster. This is why it is not possible to develop just one GTM, propose it as a patch for review and then (hopefully) commit it in Postgres core. IMHO it will never happen. And I do not think that it is actually needed. What we need is a way to be able to create own transaction managers as Postgres extension not affecting its core. This seems rather defeatist. If the code is good and reliable, why should it not be committed to core? Two reasons: 1. There is no ideal implementation of DTM which will fit all possible needs and be efficient for all clusters. 2. Even if such implementation exists, still the right way of it integration is Postgres should use kind of TM API. I hope that everybody will agree that doing it in this way: #ifdef PGXC /* In Postgres-XC, stop timestamp has to follow the timeline of GTM */ xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp; #else xlrec.xact_time = xactStopTimestamp; #endif or in this way: xlrec.xact_time = xactUseGTM ? xactStopTimestamp + GTMdeltaTimestamp : xactStopTimestamp; is very very bad idea. In OO programming we should have abstract TM interface and several implementations of this interface, for example MVCC_TM, 2PL_TM, Distributed_TM... This is actually what can be done with our XTM API. As far as Postgres is implemented in C, not in C++ we have to emulate interfaces using structures with function pointers. And please notice that there is completely no need to include DTM implementation in core, as far as it is not needed for everybody. It can be easily distributed as extension. I have that quite soon we can propose multimaster extension which should provides functionality similar with MySQL Gallera. But even right now we have integrated pg_dtm and pg_tsdtm with pg_shard and postgres_fdw, allowing to provide distributed consistency for them. All arguments against XTM can be applied to any other extension API in Postgres, for example FDW. Is it general enough? There are many useful operations which currently are not handled by this API. For example performing aggregation and grouping at foreign server side. But still it is very useful and flexible mechanism, allowing to implement many wonderful things. That is true. And everybody is entitled to an opinion on each new proposed hook, as to whether that hook is general or not. We have both accepted and rejected proposed hooks in the past. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers m
Re: [HACKERS] The plan for FDW-based sharding
On 02/26/2016 09:30 PM, Alvaro Herrera wrote: Konstantin Knizhnik wrote: Yes, it is certainly possible to develop cluster by cloning PostgreSQL. But it cause big problems both for developers, which have to permanently synchronize their branch with master, and, what is more important, for customers, which can not use standard version of PostgreSQL. It may cause problems with system certification, with running Postgres in cloud,... Actually the history of Postgres-XL/XC and Greenplum IMHO shows that it is wrong direction. That's not the point, though. I don't think a Postgres clone with a GTM solves any particular problem that's not already solved by the existing forks. However, if you have a clone at home and you make a GTM work on it, then you take the GTM as a patch and post it for discussion. There's no need for hooks for that. Just make sure your GTM solves the problem that it is supposed to solve. Excuse me if I've missed the discussion elsewhere -- why does PostgresPro have *two* GTMs instead of a single one? There are many different clusters which require different approaches for managing distributed transactions. Some clusters do no need distributed transactions at all: if you are executing OLAP queries on read-only database GTM will just add extra overhead. pg_dtm uses centralized arbiter. It is similar with Postgres-XL DTM. Presence of single arbiter signficantly simplify all distributed algorithms: failure detection, global deadlock elimination, ... But at the same time arbiter is SPOF and main factor limiting cluster scalability. pg_tsdtm is based on another approach: it is using system time as CSN and doesn't require arbiter. In theory there is no limit for scalability. But differences in system time and necessity to use more rounds of communication have negative impact on performance. So there is no ideal solution which can work well for all cluster. This is why it is not possible to develop just one GTM, propose it as a patch for review and then (hopefully) commit it in Postgres core. IMHO it will never happen. And I do not think that it is actually needed. What we need is a way to be able to create own transaction managers as Postgres extension not affecting its core. All arguments against XTM can be applied to any other extension API in Postgres, for example FDW. Is it general enough? There are many useful operations which currently are not handled by this API. For example performing aggregation and grouping at foreign server side. But still it is very useful and flexible mechanism, allowing to implement many wonderful things. From my point of view good system should be as open and customizable as possible, if it doesn't affect performance. Replacing direct function calls with indirect function calls in almost all cases can not suffer performance as well as adding hooks. So without any extra price we get better flexibility. What's wrong with it? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] The plan for FDW-based sharding
We do not have formal prove that proposed XTM is "general enough" to handle all possible transaction manager implementations. But there are two general ways of dealing with isolation: snapshot based and CSN based. pg_dtm and pg_tsdtm prove that both of them can be implemented using XTM. If you know some approach to distributed transaction manager implementation, please let us know. Otherwise your statement "is not general enough" is not concrete enough. Postgres-XL GTM can be in principle implemented as extension based on XTM. This API is based on existed PostgreSQL TM functions: we do not introduce some new abstractions. Is it possible that some other TM function has to be encapsulated? Yes, it is. But I do not see much problems with adding this function to XTM in future if it is actually needed. It happens with most APIs. It is awful when API functions are changed, breaking application based on this API. But as far as functions encapsulated in XTM are in any case present in PostgreSQL core, I do not think that them will be changed in future unless there are some plans to completely rewrite Postgres transaction manager... Yes, it is certainly possible to develop cluster by cloning PostgreSQL. But it cause big problems both for developers, which have to permanently synchronize their branch with master, and, what is more important, for customers, which can not use standard version of PostgreSQL. It may cause problems with system certification, with running Postgres in cloud,... Actually the history of Postgres-XL/XC and Greenplum IMHO shows that it is wrong direction. On 26.02.2016 19:06, Robert Haas wrote: On Fri, Feb 26, 2016 at 7:21 PM, Oleg Bartunov <obartu...@gmail.com> wrote: Right now tm is hardcoded and it's doesn't matter "if other people might need" at all. We at least provide developers ("other people") ability to work on their implementations and the patch is safe and doesn't sacrifices anything in core. I don't believe that. When we install APIs into core, we're committing to keep those APIs around. And I think that we're far too early in the development of transaction managers for PostgreSQL to think that we know what APIs we want to commit to over the long term. And what makes us think we really need multiple transaction managers, anyway? If you brave enough to say that one tm-fits-all and you are able to teach existed tm to play well in various clustering environment during development period, which is short, than probably we don't need multiple tms. But It's too perfect to believe and practical solution is to let multiple groups to work on their solutions. Nobody's preventing multiple groups for working on their solutions. That's not the question. The question is why we should install hooks in core at this early stage without waiting to see which implementations prove to be best and whether those hooks are actually general enough to cater to everything people want to do. There is talk of integrating XC/XL work into PostgreSQL; it has a GTM. Postgres Pro has several GTMs. Maybe there will be others. Frankly, I'd like to see a GTM in core at some point because I'd like everybody who uses PostgreSQL to have access to a GTM. What I don't want is for every PostgreSQL company to develop its own GTM and distribute it separately from everybody else's. IIUC, MySQL kinda did that with storage engines and it resulted in the fragmentation of the community. We've had the same thing happen with replication tools - every PostgreSQL company develops their own set. It would have been better to have ONE set that was distributed by the core project so that we didn't all do the same work over again. I don't understand the argument that without these hooks in core, people can't continue to work on this. It isn't hard to work on GTM without any core changes at all. You just patch your copy of PostgreSQL. We do this all the time, for every patch. We don't add hooks for every patch. dtms. It's time to start working on dtm, I believe. The fact you don't think about distributed transactions support doesn't mean there no "other people", who has different ideas on postgres future. That's why we propose this patch, let's play the game ! I don't like to play games with the architecture of PostgreSQL. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Relation cache invalidation on replica
The reason of the problem is that invalidation messages are not delivered to replica after the end of concurrent create index. Invalidation messages are included in xlog as part of transaction commit record. Concurrent index create is split into three transaction, last of which is just performing inplace update of index tuple, marking it as valid and invalidating cache. But as far as this transaction is not assigned XID, no transaction record is created in WAL and send to replicas. As a result, replica doesn't receive this invalidation messages. To fix the problem it is just enough to assign XID to transaction. It can be done by adding GetCurrentTransactionId() call to the end of DefineIdnex function: diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index 13b04e6..1024603 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -881,6 +881,12 @@ DefineIndex(Oid relationId, CacheInvalidateRelcacheByRelid(heaprelid.relId); /* + * Force WAL commit record to ensure that replica receives invalidation + * messages. + */ +GetCurrentTransactionId(); + +/* * Last thing to do is release the session-level lock on the parent table. */ UnlockRelationIdForSession(, ShareUpdateExclusiveLock); On 26.02.2016 15:41, Васильев Дмитрий wrote: Session opened on replica doesn't see concurrently created indexes at this time on master. We have master and replica: 1. master: pgbench -i -s 10 2. replica: explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; 3. master: ALTER INDEX pgbench_accounts_abalance_idx RENAME TO pgbench_accounts_abalance_idx_delme; CREATE INDEX CONCURRENTLY pgbench_accounts_abalance_idx ON pgbench_accounts USING btree (abalance); DROP INDEX pgbench_accounts_abalance_idx_delme; 4. at this time on replica: explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; pgbench=# explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; QUERY PLAN Index Scan using pgbench_accounts_abalance_idx on public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual time=655.781..655.781 rows=0 loops=1) Output: aid, bid, abalance, filler Index Cond: (pgbench_accounts.abalance = 1) Planning time: 9388.259 ms Execution time: 655.900 ms (5 rows) pgbench=# explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; QUERY PLAN -- Index Scan using pgbench_accounts_abalance_idx_delme on public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual time=0.014..0.014 rows=0 loops=1) Output: aid, bid, abalance, filler Index Cond: (pgbench_accounts.abalance = 1) Planning time: 0.321 ms Execution time: 0.049 ms (5 rows) pgbench=# explain (analyze,verbose) select * from pgbench_accounts where abalance = 1; QUERY PLAN Seq Scan on public.pgbench_accounts (cost=0.00..28894.00 rows=1 width=97) (actual time=3060.451..3060.451 rows=0 loops=1) Output: aid, bid, abalance, filler Filter: (pgbench_accounts.abalance = 1) Rows Removed by Filter: 100 Planning time: 0.087 ms Execution time: 3060.484 ms (6 rows) pgbench=# \d+ pgbench_accounts Table "public.pgbench_accounts" Column | Type | Modifiers | Storage | Stats target | Description --+--- aid | integer | not null | plain | | bid | integer | | plain | | abalance | integer | | plain | | filler | character(84) | | extended | | Indexes: "pgbench_accounts_pkey" PRIMARY KEY, btree (aid) "pgbench_accounts_abalance_idx" btree (abalance) Options: fillfactor=100 New opened session successfully uses this index. Tested with PostgreSQL 9.5.1. --- Dmitry Vasilyev Postgres Professional: http://www.postgrespro.com Russian Postgres Company -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] The plan for FDW-based sharding
n see what workloads it covers and decide if we are willing to copy the volume of code necessary to implement all supported Postgres XC or XL workloads. (The Postgres XL license now matches the Postgres license, http://www.postgres-xl.org/2015/07/license-change-and-9-5-merge/. Postgres XC has always used the Postgres license.) If we are not willing to add code for the missing Postgres XC/XL features, Postgres XC/XL will probably remain a separate fork of Postgres. I don't think anyone knows the answer to this question, and I don't know how to find the answer except to keep going with our current FDW sharding approach. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] pglogical - logical replication contrib module
Ok, what about the following plan: 1. Support custom WAL records (as far as I know 2ndQuadrant has such patch). 2. Add one more function to logical decoding allowing to deal with custom records. So the idea is that we somehow record DDL in WAL (for example using executor hook), then them are proceeded using logical decoding, calling special logical deocding plugin function to handle this records. For example we can store DDL in WAL just as SQL statements and so easily replay them. In this case DDL will be replicated using the same mechanism and through the same channel as DML. On 17.02.2016 12:16, Craig Ringer wrote: On 17 February 2016 at 16:24, Konstantin Knizhnik <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote: Thanks for your explanation. I have to agree with your arguments that in general case replication of DDL statement using logical decoding seems to be problematic. But we are mostly considering logical decoding in quite limited context: replication between two identical Postgres database nodes (multimaster). Yep, much like BDR. Where all this infrastructure came from and is/was aimed at. Do you think that it in this case replication of DLL can be done as sequence of low level operations with system catalog tables including manipulation with locks? No. For one thing logical decoding doesn't see catalog tuple changes right now. Though I imagine that could be changed easily enough. More importantly - oids. You add a column to a table: ALTER TABLE mytable ADD COLUMN mycolumn some_type UNIQUE NOT NULL DEFAULT some_function() This writes to catalogs including: pg_attribute pg_constraint pg_index pg_class (for the index relation) ... probably more. It also refers to pg_class (for the definition of mytable), pg_type (definition of some_type), pg_proc (definition of some_function), the b-tree operator class for some_type in pg_opclass, the b-tree indexam in pg_am, ... more. Everything is linked by oids, and the oids are all node local. You can't just blindly re-use them. If "some_type" is hstore, the oid of hstore in pg_type might be different on the upstream and downstream. The only exception is the oids of built-in types and even then that's not guaranteed across major versions. So if you blindly replicate catalog row changes you'll get a horrible mess. That's before considering a table's relfilenode, which is initially the same as its oid, but subject to change if truncated or rewritten. To even begin to do this half-sanely you'd have to maintain a mapping of upstream object oids->names on the downstream, with invalidations replicated from the upstream. That's only the beginning. There's handling of extensions and lots more fun. So in your example with ALTER TABLE statement, can we correctly replicate it to other nodes as request to set exclusive lock + some manipulations with catalog tables and data table itself? Nope. No hope, not unless "some manipulations with catalog tables and data table its self" is a lot more comprehensive than I think you mean. 1. Add option whether to include operations on system catalog tables in logical replication or not. I would like to have this anyway. 2. Make it possible to replicate lock requests (can be useful not only for DDLs) I have no idea how you'd even begin to do that. I looked how DDL was implemented in BDR and did it in similar way in our multimaster. But it is awful: we need to have two different channels for propagating changes. Yeah, it's not beautiful, but maybe you misunderstood something? The DDL is written to a table, and that table's changes are replayed along with everything else. It's consistent and keeps DDL changes as part of the xact that performed them. Maybe you misunderstood how it works in BDR and missed the indirection via a table? Additionally, in multimaster we want to enforce cluster wide ACID. It certainly includes operations with metadata. It will be very difficult to implement if replication of DML and DDL is done in two different ways... That's pretty much why BDR does it this way, warts and all. Though it doesn't offer cluster-wide ACID it does need atomic commit of xacts that may contain DML, DDL, or some mix of the two. Let me ask one more question concerning logical replication: how difficult it will be from your point of view to support two phase commit in logical replication? Are there some principle problems? I haven't looked closely yet. Andres will know more. I very, very badly want to be able to decode 2PC prepared xacts myself. The main issue I'm aware of is locking - specifically the inability to impersonate another backend and treat locks held by that backend (which might be a fake backend for a pg_prepared_xacts entry) as held by ourselves for the purpose of being able t
Re: [HACKERS] pglogical - logical replication contrib module
t;don't create FKs pointing from non-replicated tables to tables replicated from somewhere else". A concept we currently have no way to express or enforce like we do persistent-to-UNLOGGED FKs. Then there's global objects. Something as simple as: CREATE ROLE fred; CREATE TABLE blah(...) OWNER fred; will break replication because we only see the CREATE TABLE, not the CREATE ROLE. If we instead replayed the CREATE ROLE and there were multiple connections between different DBs on an upstream and downstream apply would fail on all but one. But we can't anyway since there's no way to capture that CREATE ROLE from any DB except the one it was executed in, which might not even be one of the ones doing replication. I strongly suspect we'll need logical decoding to be made aware of such global DDL and decode it from the WAL writes to the system catalogs. Which will be fun - but at least modifications to the shared catalogs are a lot simpler than the sort of gymnastics done by ALTER TABLE, etc. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Clock with Adaptive Replacement
Thank you very much for response. I am not sure that CART can significantly improve PostgreSQL performance - I just want to know opinion of community about CAR/CART and other possible alternative to GCLOCK algorithm. Looks like it CAR really provides better cache hit ratio and so at some workloads should increase Postgres performance. But now amount of memory at servers is large enough to completely keep most of typical databases in cache. So time of locating buffer in cache is more critical then time of buffer eviction. And here CART doesn't provide any benefits comparing with GCLOCK algorithm. One of the problems with GCLOCK algorithm from my point of view is that for large caches, containing larger number of pages locating victim page can take substantial amount of time, because we have to perform several turnovers before some count becomes zero. In theory CART can address this problem because there are not counters - justs single bit per page. On 12.02.2016 18:55, Robert Haas wrote: On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: What do you think about improving cache replacement clock-sweep algorithm in PostgreSQL with adaptive version proposed in this article: http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf Are there some well known drawbacks of this approach or it will be interesting to adopt this algorithm to PostgreSQL and measure it impact om performance under different workloads? I find this ten years old thread: http://www.postgresql.org/message-id/flat/d2jkde$6bg$1...@sea.gmane.org#d2jkde$6bg$1...@sea.gmane.org but it mostly discus possible patent issues with another algorithm ARC (CAR is inspired by ARC, but it is different algorithm). As far as I know there are several problems with current clock-sweep algorithm in PostgreSQL, especially for very large caches. May be CAR can address some of them? Maybe, but the proof of the pudding is in the eating. Just because an algorithm is smarter, newer, and better in general than our current algorithm - and really, it wouldn't be hard - doesn't mean that it will actually solve the problems we care about. A few of my EnterpriseDB colleagues spent a lot of time benchmarking various tweaks to our current algorithm last year and were unable to construct a test case where it sped anything up. If they tried the same tweaks against the 9.4 source base, they could get a speedup. But 9.5 had locking improvements around buffer eviction, and with those improvements committed there was no longer any measurable benefit to improving the quality of buffer eviction decisions. That's a surprising result, to me anyway, and somebody else might well find a test case where a benefit can be shown - but our research was not successful. I think it's important to spend time and energy figuring out exactly what the problems with our current algorithm are. We know in general terms that usage counts tend to converge to either 5 or 0 and therefore sometimes evict buffers both at great cost and almost randomly. But what's a lot less clear is how much that actually hurts us given that we are relying on the OS cache anyway. It may be that we need to fix some other things before or after improving the buffer eviction algorithm before we actually get a performance benefit. I suspect, for example, that a lot of the problems with large shared_buffers settings have to do with the bgwriter and checkpointer behavior rather than with the buffer eviction algorithm; and that others have to do with cache duplication between PostgreSQL and the operating system. So, I would suggest (although of course it's up to you) that you might want to focus on experiments that will help you understand where the problems are before you plunge into writing code to fix them. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Clock with Adaptive Replacement
Hi hackers, What do you think about improving cache replacement clock-sweep algorithm in PostgreSQL with adaptive version proposed in this article: http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf Are there some well known drawbacks of this approach or it will be interesting to adopt this algorithm to PostgreSQL and measure it impact om performance under different workloads? I find this ten years old thread: http://www.postgresql.org/message-id/flat/d2jkde$6bg$1...@sea.gmane.org#d2jkde$6bg$1...@sea.gmane.org but it mostly discus possible patent issues with another algorithm ARC (CAR is inspired by ARC, but it is different algorithm). As far as I know there are several problems with current clock-sweep algorithm in PostgreSQL, especially for very large caches. May be CAR can address some of them? -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] eXtensible Transaction Manager API (v2)
Hi, PostgresProffesional cluster teams wants to propose new version of eXtensible Transaction Manager API. Previous discussion concerning this patch can be found here: http://www.postgresql.org/message-id/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru The API patch itself is small enough, but we think that it will be strange to provide just API without examples of its usage. We have implemented various implementations of distributed transaction manager based on this API: pg_dtm (based ion snapshot sharing) and pg_tsdtm (CSN based on local system time). Based on this two DTM implementation we have developed various "cluster" implementations: multimaster+pg_dtm, multimaster+pg_tsdtm, pg_shard+pg_dtm, pg_shard+pg_tsdtm, postgres_fdw+pg_dtm, postgres_fdw+pg+tsdtm,... Multimaster is based on logical replication is something like BDR but synchronous: provide consistency across cluster. But we want to make this patch as small as possible. So we decided to include in it only pg_tsdtm and patch of postgres_fdw allowing to use it with pg_tsdtm. pg_tsdtm is simpler than pg_dtm because last one includes arbiter with RAFT protocol (centralized service) and sockhub for efficient multiplexing backend connections. Also, in theory, pg_tsdtm provides better scalability, because it is decentralized. Architecture of DTM and tsDTM as well as benchmark results are available at WiKi page: https://wiki.postgresql.org/wiki/DTM Please notice pg-tsdtm is just reference implementation of DTM using this XTM API. The primary idea of this patch is to add XTM API to PostgreSQL code, allowing to implement own transaction managers as Postgres extension. So please review first of all XTM API itself and not pg_tsdtm which is just and example of its usage. The complete PostgreSQL branch with all our changes can be found here: https://github.com/postgrespro/postgres_cluster -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/contrib/pg_tsdtm/Makefile b/contrib/pg_tsdtm/Makefile new file mode 100644 index 000..e70dffc --- /dev/null +++ b/contrib/pg_tsdtm/Makefile @@ -0,0 +1,20 @@ +MODULE_big = pg_tsdtm +OBJS = pg_tsdtm.o + +EXTENSION = pg_tsdtm +DATA = pg_tsdtm--1.0.sql + +ifdef USE_PGXS +PG_CONFIG = pg_config +PGXS := $(shell $(PG_CONFIG) --pgxs) +include $(PGXS) +else +subdir = contrib/pg_tsdtm +top_builddir = ../.. +include $(top_builddir)/src/Makefile.global +include $(top_srcdir)/contrib/contrib-global.mk +endif + +check: + env DESTDIR='$(abs_top_builddir)'/tmp_install make install + $(prove_check) diff --git a/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp b/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp new file mode 100644 index 000..38285be --- /dev/null +++ b/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp @@ -0,0 +1,129 @@ +#include +#include +#include +#include + +#include +#include +#include + +using namespace std; +using namespace pqxx; + +int main (int argc, char* argv[]) +{ +if (argc == 1){ +printf("Use -h to show usage options\n"); +return 1; +} +vector connections; +set prepared_xacts; +set committed_xacts; +bool verbose = false; +for (int i = 1; i < argc; i++) { +if (argv[i][0] == '-') { +switch (argv[i][1]) { + case 'C': + case 'c': +connections.push_back(string(argv[++i])); +continue; + case 'v': +verbose = true; +continue; +} +} +printf("Perform recovery of pg_tsdtm cluster.\n" + "Usage: dtm_recovery {options}\n" + "Options:\n" + "\t-c STR\tdatabase connection string\n" + "\t-v\tverbose mode: print extra information while processing\n"); +return 1; +} +if (verbose) { +cout << "Collecting information about prepared transactions...\n"; +} +for (vector::iterator ic = connections.begin(); ic != connections.end(); ++ic) +{ +if (verbose) { +cout << "Connecting to " << *ic << "...\n"; +} +connection con(*ic); +work txn(con); +result r = txn.exec("select gid from pg_prepared_xacts"); +for (result::const_iterator it = r.begin(); it != r.end(); ++it) +{ +string gid = it.at("gid").as(string()); +prepared_xacts.insert(gid); +} +txn.commit(); +} +if (verbose) { +cout << "Prepared transactions: "; +for (set::iterator it = prepared_xacts.begin(); it != prepared_xacts.end(); ++it) +{ +cout << *it << ", "; +} +cout << "\nChecking which of them a
Re: [HACKERS] Batch update of indexes
Attached please find patch for "ALTER INDEX ... WHERE ..." clause. It is now able to handle all three possible situations: 1. Making index partial (add WHERE condition to the ordinary index) 2. Extend partial index range (less restricted index predicate) 3. Arbitrary change of partial index predicate In case 2) new records are added to the index. In other two cases index is completely reconstructed. This patch includes src/bin/insbench utility for testing insert performance. It can be easily excluded from the patch to reduce it size. Also it is better to apply this patch together with "index-only scans with partial indexes" patch: http://www.postgresql.org/message-id/560c7213.3010...@2ndquadrant.com only in this case regression test will produce expected output. On 27.01.2016 23:15, Robert Haas wrote: On Wed, Jan 20, 2016 at 4:28 AM, Konstantin Knizhnik <k.knizh...@postgrespro.ru> wrote: Please notice that such alter table statement, changing condition for partial index, is not supported now. But I do not see any principle problems with supporting such construction. We should just include in the index all records which match new condition and do not match old condition: ts < '21/01/2016' and not (ts < '20/01/2016') You'd also need to remove any rows from the index that match the old condition but not the new one. In your example, that's impossible, but in general, it's definitely possible. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c index b450bcf..b6ffb19 100644 --- a/src/backend/commands/indexcmds.c +++ b/src/backend/commands/indexcmds.c @@ -32,6 +32,7 @@ #include "commands/tablespace.h" #include "mb/pg_wchar.h" #include "miscadmin.h" +#include "funcapi.h" #include "nodes/nodeFuncs.h" #include "optimizer/clauses.h" #include "optimizer/planner.h" @@ -50,6 +51,9 @@ #include "utils/snapmgr.h" #include "utils/syscache.h" #include "utils/tqual.h" +#include "utils/ruleutils.h" +#include "executor/executor.h" +#include "executor/spi.h" /* non-export function prototypes */ @@ -275,6 +279,160 @@ CheckIndexCompatible(Oid oldId, return ret; } +static void +UpdateIndex(Oid indexRelationId, Node* whereClause) +{ + Datum values[Natts_pg_index]; + bool isnull[Natts_pg_index]; + HeapTuple oldTuple; + HeapTuple newTuple; + Relation pg_index; + + pg_index = heap_open(IndexRelationId, RowExclusiveLock); + oldTuple = SearchSysCacheCopy1(INDEXRELID, ObjectIdGetDatum(indexRelationId)); + if (!HeapTupleIsValid(oldTuple)) + elog(ERROR, "cache lookup failed for index %u", indexRelationId); + + heap_deform_tuple(oldTuple, RelationGetDescr(pg_index), values, isnull); + values[Anum_pg_index_indpred - 1] = CStringGetTextDatum(nodeToString(whereClause)); + isnull[Anum_pg_index_indpred - 1] = false; + newTuple = heap_form_tuple(RelationGetDescr(pg_index), values, isnull); + simple_heap_update(pg_index, >t_self, newTuple); + CatalogUpdateIndexes(pg_index, newTuple); + heap_freetuple(newTuple); + heap_freetuple(oldTuple); + heap_close(pg_index, NoLock); +} + +void +AlterIndex(Oid indexRelationId, IndexStmt *stmt) +{ + char* select; + Oid heapRelationId; + IndexUniqueCheck checkUnique; + Datum values[INDEX_MAX_KEYS]; + bool isnull[INDEX_MAX_KEYS]; + Relation heapRelation; + Relation indexRelation; +SPIPlanPtr plan; +Portal portal; + HeapTuple tuple; + TupleTableSlot *slot; + ItemPointer tupleid; + IndexInfo *indexInfo; + EState *estate; + Oid namespaceId; + List* deparseCtx; + char* oldIndexPredicate; + char* newIndexPredicate; + char* relationName; + + Assert(stmt->whereClause); + CheckPredicate((Expr *) stmt->whereClause); + + /* Open and lock the parent heap relation */ + heapRelationId = IndexGetRelation(indexRelationId, false); + heapRelation = heap_open(heapRelationId, AccessShareLock); + + /* Open the target index relation */ + /* indexRelation = index_open(indexRelationId, RowExclusiveLock); */ + indexRelation = index_open(indexRelationId, ShareUpdateExclusiveLock); + /* indexRelation = index_open(indexRelationId, AccessShareLock); */ + namespaceId = RelationGetNamespace(indexRelation); + + indexInfo = BuildIndexInfo(indexRelation); + Assert(!indexInfo->ii_ExclusionOps); + + /* + * Generate the constraint and default execution states + */ + estate = CreateExecutorState(); + + checkUnique = indexRelation->rd_index->indisunique ? UNIQUE_CHECK_YES : UNIQUE_CHECK_NO; + + slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation)); + + deparseCtx = deparse_context_for(RelationGetRelationName(heapRelation), heapRelationId); + relationName = quote_qualified_identifier(get_namespace_name(namespaceId), + get_
Re: [HACKERS] Batch update of indexes
On Feb 4, 2016, at 2:00 AM, Jim Nasby wrote: > > My suspicion is that it would be useful to pre-order the new data before > trying to apply it to the indexes. Sorry, but ALTER INDEX is expected to work for all indexes, not only B-Tree, and for them sorting may not be possible... But for B-Tree presorting inserted data should certainly increase performance. I will think about it. > -- > Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX > Experts in Analytics, Data Architecture and PostgreSQL > Data in Trouble? Get it in Treble! http://BlueTreble.com > > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: index-only scans with partial indexes
I have applied this patch to our working branch and during several weeks we ran various tests and benchmarks. We have not noticed any problems or performance degradation. And at some queries this patch cause very significant increase of performance - ten times: With this patch: postgres=# explain analyze select count(*) from t where k1<100 and pk < 1454434742881892; QUERY PLAN Aggregate (cost=29.65..29.66 rows=1 width=0) (actual time=0.108..0.108 rows=1 loops=1) -> Index Only Scan using idx1 on t (cost=0.43..27.49 rows=861 width=0) (actual time=0.012..0.071 rows=963 loops=1) Index Cond: (k1 < 100) Heap Fetches: 0 Planning time: 0.100 ms Execution time: 0.121 ms (6 rows) Without patch: postgres=# explain analyze select count(*) from t where k1<100 and pk < 1454434742881892; QUERY PLAN Aggregate (cost=2951.55..2951.56 rows=1 width=0) (actual time=1.070..1.070 rows=1 loops=1) -> Bitmap Heap Scan on t (cost=19.10..2949.40 rows=861 width=0) (actual time=0.161..0.997 rows=963 loops=1) Recheck Cond: ((k1 < 100) AND (pk < '1454434742881892'::bigint)) Heap Blocks: exact=954 -> Bitmap Index Scan on idx1 (cost=0.00..18.88 rows=861 width=0) (actual time=0.083..0.083 rows=963 loops=1) Index Cond: (k1 < 100) Planning time: 0.099 ms Execution time: 1.089 ms (8 rows) On 01.02.2016 01:11, Alvaro Herrera wrote: Konstantin Knizhnik wrote: I am very interested in this patch because it allows to use partial indexes to ... speed up inserts. I have implemented "ALTER INDEX ... WHERE ..." construction which allows to change predicate of partial index without necessity to fully rebuild it. So it is not necessary to insert new records in index immediately (if new records do not match partial index conditions). It can be done later in background (or at night). My experiments show that it allows to increase insert speed five times (for either partial indexes). At the same time we do not loose RDBMS requirement that result of query should not depend on presence of indexes. And it is applicable to all indexes: B-Tree, GIN, GIST,... But such optimization makes sense only of partial indexes can be used without extra overhead, first of all for index-only scans. And it is impossible without this patch. That sounds interesting. So please review this patch and let us know whether you like it, or whether you have any better ideas for any particular hunk, or whether you think it should be rewritten from scratch, or just state that it is perfect. If you think it's useful, then it's a good idea to vouch for it to be integrated; and one way of doing that is making sure it meets the quality standards etc. If you don't say anything about the patch, we may never integrate it because we might have doubts about whether it's worthy. -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] PATCH: index-only scans with partial indexes
I am very interested in this patch because it allows to use partial indexes to ... speed up inserts. I have implemented "ALTER INDEX ... WHERE ..." construction which allows to change predicate of partial index without necessity to fully rebuild it. So it is not necessary to insert new records in index immediately (if new records do not match partial index conditions). It can be done later in background (or at night). My experiments show that it allows to increase insert speed five times (for either partial indexes). At the same time we do not loose RDBMS requirement that result of query should not depend on presence of indexes. And it is applicable to all indexes: B-Tree, GIN, GIST,... But such optimization makes sense only of partial indexes can be used without extra overhead, first of all for index-only scans. And it is impossible without this patch. On 01/31/2016 03:34 PM, Alvaro Herrera wrote: Tomas Vondra wrote: On 12/24/2015 04:05 AM, Michael Paquier wrote: Tomas, are you still working on that? This thread is stalling for 3 weeks. I haven't discovered anything interesting during the testing, so I guess the "needs review" state is appropriate. Let's move the patch to the next commitfest. Not sure what to do here, since this patch got no feedback at all in this CF. The right thing to do, ISTM, is to just move it again to the next CF. But it'd be really useful if someone can have it a look and verify at least whether it doesn't need a rebase (requiring a further submission) so that other people can play with it. Of course, if Horiguchi-san or anyone has more review comments, that would be even better. Tomas said he'd do more testing, but we never got a report on whether anything turned up. (At this point I'm not sure if either Kyotaro or Tomas should be considered the patch author ... maybe both?) -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Optimizer questions
Unfortunately this two statements are not equivalent: second one can (in theory, but not for this particular data set) return more than 10 result records. Such optimization will be correct if t2.i is declared as unique. But the most efficient plan for this query will be generated if there is index for t1.v. In this case no explicit sot is needed. Limit is still not pushed down, but it is not a problem because nested join is used which is not materializing its result (produces records on demand): # explain analyze select * from t1 left outer join t2 on t1.k=t2.k order by t1.v limit 10; QUERY PLAN -- Limit (cost=0.58..4.38 rows=10 width=16) (actual time=0.069..0.157 rows=10 loops=1) -> Nested Loop Left Join (cost=0.58..37926.63 rows=11 width=16) (actual time=0.067..0.154 rows=10 loops=1) -> Index Scan using idxv on t1 (cost=0.29..3050.31 rows=11 width=8) (actual time=0.046..0.053 rows=10 loops=1) -> Index Scan using t2_pkey on t2 (cost=0.29..0.34 rows=1 width=8) (actual time=0.007..0.007 rows=1 loops=10) Index Cond: (t1.k = k) Planning time: 0.537 ms Execution time: 0.241 ms (7 rows) On 01/30/2016 01:01 AM, Alexander Korotkov wrote: On Fri, Jan 8, 2016 at 11:58 AM, Konstantin Knizhnik <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote: Attached please find improved version of the optimizer patch for LIMIT clause. Now I try to apply this optimization only for non-trivial columns requiring evaluation. May be it will be better to take in account cost of this columns evaluation but right now I just detect non-variable columns. We may think about more general feature: LIMIT pushdown. In the Konstantin's patch planner push LIMIT before tlist calculation. But there are other cases when calculating LIMIT first would be beneficial. For instance, we can do LIMIT before JOINs. That is possible only for LEFT JOIN which is not used in filter and ordering clauses. See the example below. # create table t1 as (select i, random() v from generate_series(1,100) i); SELECT 100 # create table t2 as (select i, random() v from generate_series(1,100) i); SELECT 100 # explain analyze select * from t1 left join t2 on t1.i = t2.i order by t1.v limit 10; QUERY PLAN Limit (cost=87421.64..87421.67 rows=10 width=24) (actual time=1486.276..1486.278 rows=10 loops=1) -> Sort (cost=87421.64..89921.64 rows=100 width=24) (actual time=1486.275..1486.275 rows=10 loops=1) Sort Key: t1.v Sort Method: top-N heapsort Memory: 25kB -> Hash Left Join (cost=27906.00..65812.00 rows=100 width=24) (actual time=226.180..1366.238 rows=100 Hash Cond: (t1.i = t2.i) -> Seq Scan on t1 (cost=0.00..15406.00 rows=100 width=12) (actual time=0.010..77.041 rows=100 l -> Hash (cost=15406.00..15406.00 rows=100 width=12) (actual time=226.066..226.066 rows=100 loop Buckets: 131072 Batches: 1 Memory Usage: 46875kB -> Seq Scan on t2 (cost=0.00..15406.00 rows=100 width=12) (actual time=0.007..89.002 rows=100 Planning time: 0.123 ms Execution time: 1492.118 ms (12 rows) # explain analyze select * from (select * from t1 order by t1.v limit 10) t1 left join t2 on t1.i = t2.i; QUERY PLAN Hash Right Join (cost=37015.89..56171.99 rows=10 width=24) (actual time=198.478..301.278 rows=10 loops=1) Hash Cond: (t2.i = t1.i) -> Seq Scan on t2 (cost=0.00..15406.00 rows=100 width=12) (actual time=0.005..74.303 rows=100 loops=1) -> Hash (cost=37015.77..37015.77 rows=10 width=12) (actual time=153.584..153.584 rows=10 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 1kB -> Limit (cost=37015.64..37015.67 rows=10 width=12) (actual time=153.579..153.580 rows=10 loops=1) -> Sort (cost=37015.64..39515.64 rows=100 width=12) (actual time=153.578..153.578 rows=10 loops=1) Sort Key: t1.v Sort Method: top-N heapsort Memory: 25kB -> Seq Scan on t1 (cost=0.00..15406.00 rows=100 width=12) (actual time=0.012..78.828 rows=100 Planning time: 0.132 ms Execution time: 301.308 ms (12 rows) In this example LIMIT pushdown makes query 5 times faster. It would be very nice if optimizer make this automatically. -- Alexander Korotkov Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/> The Russian Postgres Comp
Re: [HACKERS] Batch update of indexes
Hi hackers, I have implemented "ALTER INDEX ... WHERE ..." clause allowing to change condition for partial index. Actually it allows us to append index without fully rebuilding it. As I explained in the previous mails, partial indexes can be used to increase insert speed. Right now I get the following results (with one insert stream): Insert with 1 index (primary key, monotonically ascending): 324275 TPS Insert with 9 indexes (primary key + 8 indexes with random keys): 52495 TPS Insert with primary key and 8 concurrently updated partial indexes: 194458 TPS Insert with primary key and 8 "frozen" partial indexes: 278446 TPS So, as you can see insert with indexes is about 6 times slower than insert without indexes. And partial indexes allows to eliminate this gap. When partial indexes are not affected (assuming that them will be reconstructed "at night"), performance is almost the same, as without indexes. And if "ALTER INDEX" is done concurrently with inserts, it certainly decrease insert speed, but still it is 4 times faster than with normal indexes. Such high TPS values were obtained using "insert from select" to bypass libpq overhead. With libpq (when each insert is sent as independent statement) results are less impressive: Insert with 1 index (primary key, monotonically ascending): 37892 TPS Insert with 9 indexes (primary key + 8 indexes with random keys): 20231 TPS Insert with primary key and 8 concurrently updated partial indexes: 26934 TPS Insert with primary key and 8 "frozen" partial indexes: 28863 TPS But still partial indexes allows to almost eliminate two times differences... This results can be reproduced using our public repository: https://github.com/postgrespro/postgres_cluster Most of the code related with support of "ALTER INDEX .. WHERE" is in AlterIndex function in postgres_cluster/src/backend/commands/indexcmds.c I have also added insbench utility for measuring insert performance, using which this results were obtained. It is located in postgres_cluster/src/bin/insbench directory. Known issues: 1. I do not handle case when new condition for partial index is more restricted than original. There is no way in Postgres to exclude records from index (except VACUUM), so in this case index has to be reconstructed from scratch. 2. Currently I am using SPI to locate records which should be included in index. 3. I am not completely sure that there are no synchronization/isolation problems in AlterIndex function If this approach is considered to be interesting by community, I will try to address these issues. On 20.01.2016 12:28, Konstantin Knizhnik wrote: Hi hackers, I want to know opinion of community about possible ways of solving quite common problem: increasing insert speed while still providing indexes for efficient execution of queries. Many applications have to deal with high input stream of data. Most of the time while record inserting in the database is taken for update of indexes. And without indexes we are not able to efficiently execute queries. So in many cases it is desirable to have "batch or concurrent" index update. And it is acceptable that an index is slightly behind current state of the table. One interesting approach of solving this problem is discussed in this article: https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb Them are using materialized views to build indexes in background. Interesting idea, but copying content of the whole table just to be able to build index concurrently seems to be overkill. I thought about more straightforward ways of solving this problem. It will be nice if we can preserve of of them main postulates of Postgres and other RDBMSes: indexes are just optimization and result of query should not depend on presence of indexes. First idea is to use inheritance. I have investigated different ways of splitting table into "archival" and "operational" parts, but all of them requiring physical copying of data from one table to another. Another idea is to use partial indexes (http://www.postgresql.org/docs/current/static/indexes-partial.html) Assume that we have stream of input data where each record have increased timestamp: create table t( ts timestamp primary key, c1 real, c2 integer, c3 varchar, ... cN char(5) ); We want to provide the highest insert speed for "t" but provide indexes for c1..cN fields. We can declared partial indexes: create index idx1 on t(c1) where ts < '20/01/2016'; create index idx2 on t(c2) where ts < '20/01/2016'; ... create index idxN on t(cN) where ts < '20/01/2016'; As far as this indexes do not cover current date, them will not be affecte
Re: [HACKERS] Batch update of indexes
On 21.01.2016 10:14, Simon Riggs wrote: On 21 January 2016 at 06:41, konstantin knizhnik <k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote: Certainly for B-Tree we can organize insert buffer (or pending list) as sorted array or also as a tree. But in both case complexity of search in this buffer will be O(log(N)), not O(1). You are right; search within this buffer is O(log(N)). But we can test whether we need to search in the buffer at all with O(1). Only if records are inserted in key ascending order... But usually we need to maintain more than once index and and for them it is not true. -- Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Re: [HACKERS] Batch update of indexes
On 21.01.2016 19:09, Anastasia Lubennikova wrote: What I meant is more like a BRIN-like combination of an index scan and heap scan. Maybe it could be called "deferred inserts" or "temporary read-only index" Maybe it's similar with mysql insert buffer http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html I think it'll be more clear with example. Please don't care about syntax. CREATE TABLE tbl (c1 int); CREATE INDEX idx on tbl(c1); SET enable_deferred_insert(idx) = on; At this moment, we save the last_indexed_item (its TID) somewhere in index metapage. Since that moment, the data inserted into the table doesn't touch the index. We perform some heavy insert and then go back to the normal index behavior. SET enable_deferred_insert(idx) = off; This command takes all the data between the last_indexed_item and the end of the table, and inserts it into the index at a time. Of course there are new problems to deal with, but it's really useful for the use case to balance irregular heavy write load, isn't it? BTW, could you explain, what is the reason to copy data into the pending list and then copy it again while flushing pending list into the index? Why not read this data directly from the table? I feel that I've missed something important here. No, I do not think that inserted data should be placed in pending list and then copied to main table. It should be stored directly in the main table and "pending list" is just some fast, transient index. From my point of view there are two possibilities: 1. Preserve strict table-index consistency: query results should not depend on presence of the index 2. Support out-of-date or deferred indexes, which can be updated in background. Second approach is certainty more efficient and IMHO it acceptable for most of the users. But we are violating one of the fundamental properties of RDBMes... So I am not sure which approach to chose. First case is also harder to implement, because we have to somehow merge two indexes during index scan and provide proper recovery of main index in case of failure (assuming that pending list is maintained in memory and is lost after the fault). -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Batch update of indexes
On Jan 21, 2016, at 5:14 AM, Simon Riggs wrote: > On 20 January 2016 at 14:55, Konstantin Knizhnik <k.knizh...@postgrespro.ru> > wrote: > Hi, > > Hi, I glad to see that you interested in that too. > I think this is a good feature and I think it will be very useful to have. > I have already mentioned some related problems and possible improvements in > my presentation. > http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts > Last two slides concern to this thread. Briefly, I've suggested to think > about insertion buffer. Something very similar to it is already implemented > in BRIN. It does not index last data from heap, while the number of last > pages is less than pages_per_block. > > Do you mean GIN-like usage of insertion buffer (here it is called "pending > list")? > So that we have to combine search in the main tree and in the insert buffer? > Actually this is what I want to avoided (because at least in case of GIN > pending list cause significant degrade of performance, > while up-to-date state of full text index is rarely required). > > Degrade in performance is because scan of pending list is O(N). > > If we did the same thing for monotonic inserts into a btree, the performance > of ruling out any contents in the pending list would be O(1), so it is more > feasible than you say. Sorry, didn't get it. Certainly for B-Tree we can organize insert buffer (or pending list) as sorted array or also as a tree. But in both case complexity of search in this buffer will be O(log(N)), not O(1). O(1) is possible only if we will use hash table for pending list and are lucky with hash function. But in this case it will be not possible to support range queries and proper order. In any case, necessity to perform two searches instead of one and combining results will significantly complicate index implementation. But definitely this solution is more convenient and transparent for users... > > -- > Simon Riggshttp://www.2ndQuadrant.com/ > PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: [HACKERS] Batch update of indexes
Hi, Hi, I glad to see that you interested in that too. I think this is a good feature and I think it will be very useful to have. I have already mentioned some related problems and possible improvements in my presentation. http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts Last two slides concern to this thread. Briefly, I've suggested to think about insertion buffer. Something very similar to it is already implemented in BRIN. It does not index last data from heap, while the number of last pages is less than pages_per_block. Do you mean GIN-like usage of insertion buffer (here it is called "pending list")? So that we have to combine search in the main tree and in the insert buffer? Actually this is what I want to avoided (because at least in case of GIN pending list cause significant degrade of performance, while up-to-date state of full text index is rarely required). The next point, I've thought about is a bulk update. Problem is that update like "UPDATE mytable set a = a+1;" causes N searches from the root of B-tree. I looks very strange to me, and I'd like to fix it somehow. The obvious solution is to update all tuples on the page at a time, and keep the number of last updated page. But, maybe it's a bit off-thread here. Bulk update is the second question (but very important). First I just want to be able to append index concurrently, not blocking insert. One interesting approach of solving this problem is discussed in this article: https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb Them are using materialized views to build indexes in background. Interesting idea, but copying content of the whole table just to be able to build index concurrently seems to be overkill. This approach seems like a tricky crutch to me. And I agree that it requires a lot of extra work. It will be very interesting to know how people are using materialized views. Delayed building of indexes seems to be one of the popular use cases, although requiring large overhead, first of all storage overhead. Please notice that such alter table statement, changing condition for partial index, is not supported now. Don't you think, that this feature could be used in a very wrong way? Do not take it as criticism, just a bit of thoughts. Everything which can be misused, will be misused:) But I do not worry much about it... If it can address real challenges, then it will be good thing in any case. Ideally we should be able to alter everything. Naive implementation of such alter clause can just to build new index with temporary name, then drop old index and rename new index. There was the discussion of the patch for partial indexes. http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html Since I haven't watched it closely, It seems to be open still. I think it'll be interesting to you. So small patch... Why it was not accepted? I do no see any problems with it... -- Anastasia Lubennikova Postgres Professional:http://www.postgrespro.com The Russian Postgres Company -- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers