Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-13 Thread Konstantin Knizhnik



On 13.02.2017 19:20, Tom Lane wrote:

Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes:

I wonder why SUM aggregate is calculated for real (float4) type using
floating point accumulator?

If you can't deal with the vagaries of floating-point arithmetic, you
shouldn't be storing your data in float format.  Use numeric.


4-byte floats are widely used for example in trading applications just 
because it is two times shorter then double and range of stored data is 
relatively small (do not need a lot of significant digits). At the same 
time volume of stored data is very large and switching from float4 to 
float8 will almost double it. It requires two times more storage and 
almost two times increase query execution time.

So this is not acceptable answer.




Are there are reasons of using float4pl function for SUM aggregate instead of 
float4_accum?

The latter is probably a good two orders of magnitude slower, and it
wouldn't really do much to solve the inherent accuracy problems of
adding float4 values that have a wide dynamic range.


It is not true - please notice query execution time of this two queries:

postgres=# select sum(l_quantity)  from lineitem where l_shipdate <= 
'1998-12-01';

 sum
-
 1.52688e+09
(1 row)

Time: 2858.852 ms
postgres=# select sum(l_quantity+0.0)  from lineitem where l_shipdate <= 
'1998-12-01';

sum

 1529738036
(1 row)

Time: 3174.529 ms


Looks like now in Postgres aggregate calculation itself is not a 
bottleneck, comparing with tuple deform cost.





The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.


From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to 
represent range of data I have and I need to minimize size of record.
But when I am calculating sum, I expect to receive more or less precise 
result. Certainly I realize that even in case of using double it is 
possible to loose precision while calculation and result may depend on 
sum order (if we add very small and very larger values). But in real use 
cases (for example in trading data) such large difference in attribute 
values is very rare. If you have, for example, stock price, then it is 
very unlikely that one company has value 0.01 and another 1000.0
At least in TPC-H example (which certainly deal with dummy generated 
data), double type produce "almost price" result.




regards, tom lane


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)

2017-02-13 Thread Konstantin Knizhnik

Hi,

I wonder if it is possible to somehow benchmark your clustered index 
implementation.

I tried to create VCI index for lineitem table from TPC and run Q6 query.
After index creation Postgres is not using parallel execution plan any 
more but speed of sequential plan is not changed

and nothing in query execution plan indicates that VCI index is used:


postgres=# explain select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
QUERY PLAN

---
-
 Finalize Aggregate  (cost=608333.85..608333.86 rows=1 width=4)
   ->  Gather  (cost=608333.23..608333.84 rows=6 width=4)
 Workers Planned: 6
 ->  Partial Aggregate  (cost=607333.23..607333.24 rows=1 width=4)
   ->  Parallel Seq Scan on lineitem_projection 
(cost=0.00..607024.83 rows=61680 width=8)
 Filter: ((l_shipdate >= '1996-01-01'::date) AND 
(l_shipdate <= '1997-01-01'::date) AND (l_discount >= '0.08'::double 
precision) AN
D (l_discount <= '0.1'::double precision) AND (l_quantity < '24'::double 
precision))

(6 rows)

postgres=# select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
   revenue
-
 6.2e+08
(1 row)

Time: 1171.324 ms (00:01.171)

postgres=# create index vci_idx on lineitem_projection using 
vci(l_shipdate,l_quantity,l_extendedprice,l_discount,l_tax,l_returnflag,l_linestatus);

CREATE INDEX
Time: 4.705 ms


postgres=# explain select
* from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
QUERY PLAN

---
---
 Seq Scan on lineitem_projection  (cost=0.00..382077.00 rows=1 width=22)
   Filter: ((l_shipdate >= '1996-01-01'::date) AND (l_shipdate <= 
'1997-01-01'::date) AND (l_discount >= '0.08'::double precision) AND 
(l_discount <= '

0.1'::double precision) AND (l_quantity < '24'::double precision))
(2 rows)

postgres=# select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01' and '1997-01-01'
and l_discount between 0.08 and 0.1
and l_quantity < 24;
  revenue

 6.2112e+08
(1 row)

Time: 4304.355 ms (00:04.304)


I wonder if there is any query which can demonstrate advantages of using 
VCI index?


On 06.02.2017 04:26, Haribabu Kommi wrote:



On Fri, Feb 3, 2017 at 8:28 PM, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:


On 30.12.2016 06:55, Haribabu Kommi wrote:


Hi All,

Fujitsu was interested in developing a columnar storage extension
with minimal
changes the server backend.


We  in PostgresPRO are also very interested in developing vertical
storage (VS) for Postgres.
And after considering many alternatives, we came to the conclusion
that approach based on representing columnar store as access
method (index)
is the most promising one.

It allows to:
1. Implement VS as extension without affecting Postgres core.
2. Have both ROS and WOS.
3. Create multiple projections (as in Vertica).
4. Optimize insert speed by support batch inserts and use flexible
recovery model for VS.

So it is very similar with your approach. But there are few
differences:

1. Our intention is to completely eliminate changes in Postgres core.

You wrote:

Yes, it is a mix of both index and table access methods. The
current design
of Vertical clustered index needs both access methods, because of
this reason
we used both access methods.

But I still do not completely understand why it is not possible to
use VS in index only scans without any changes and standard
Postgres executor?
Why it is not possible to rely on standard rules of applying
indexes in Postgres optimizer based on costs provided by our AM
implementation?


In our storage design, we used TID-CRID map to identify a record in heap
to columnar storage. Because of HOT update, the new data will not be 
inserted
into indexes, but this will give problem to the columnar storage, so 
we added

a hook to insert index data even if the update is HOT.

And also we added another hook for initializing the 

[HACKERS] VOPS: vectorized executor for Postgres: how to speedup OLAP queries more than 10 times without changing anything in Postgres executor

2017-02-13 Thread Konstantin Knizhnik

Hello hackers,

There were many discussions concerning  possible ways of speeding-up 
Postgres. Different approaches were suggested:


- JIT (now we have three different prototype implementations based on LLVM)
- Chunked (vectorized) executor
- Replacing pull with push
- Columnar store (cstore_fdw, IMCS)
- Optimizing and improving current executor (reducing tuple deform 
overhead, function call overhead,...)


Obviously the best result can be achieved in case of combining all this 
approaches. But actually them are more or less interchangeable: 
vectorized execution is not eliminating interpretation overhead, but it 
is divided by vector size and becomes less critical.


I decided to write small prototype to estimate possible speed 
improvement of vectorized executor. I created special types representing 
"tile" and implement standard SQL operators for them. So neither 
Postgres planer, nether Postgres executor, nether Postgres heap manager 
are changed. But I was able to reach more than 10 times speed 
improvement on TPC-H Q1/Q6 queries!


Please find more information here: 
https://cdn.rawgit.com/postgrespro/vops/ddcbfbe6/vops.html
The sources of the project can be found here: 
https://github.com/postgrespro/vops.git


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Sum aggregate calculation for single precsion real

2017-02-13 Thread Konstantin Knizhnik

Hi hackers,

I wonder why SUM aggregate is calculated for real (float4) type using 
floating point accumulator?

It cause very confusing and unexpected behavior:

-- postgres=# select sum(l_quantity)  from lineitem where l_shipdate <= 
'1998-12-01';
 sum
-
 1.52688e+09
(1 row)

postgres=# select sum(l_quantity+0.0)  from lineitem where l_shipdate <= 
'1998-12-01';
sum

 1529738036


It is specified in any SQL standard how aggregates should be calculated?
At least Oracle and MS-SQL are calculating SUM for single precision type in 
different (and more natual) way.
Are there are reasons of using float4pl function for SUM aggregate instead of 
float4_accum?



Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Active zombies at AIX

2017-02-06 Thread Konstantin Knizhnik
I tried to rebuild Postgres without mmap and the problem disappears 
(pgbench with -C doesn't cause connection limit exhaustion any more).
Unfortunately there is no any convenient way to configure PostgreSQL not 
to use mmap.

I have to edit sysv_shmem.c file, commenting

#ifndef EXEC_BACKEND
#define USE_ANONYMOUS_SHMEM
#endif

I wonder why do we prohibit now configuration of Postgres without mmap?


On 06.02.2017 12:47, Konstantin Knizhnik wrote:

Last update on the problem.
Using kdb tool (thank's to Tony Reix for advice and help) we get the 
following trace of Poastgres backend in existing stack trace:


pvthread+073000 STACK:
[005E1958]slock+000578 (005E1958, 80001032 [??])
[9558].simple_lock+58 ()
[00651DBC]vm_relalias+00019C (??, ??, ??, ??, ??)
[006544AC]vm_map_entry_delete+00074C (??, ??, ??)
[00659C30]vm_map_delete+000150 (??, ??, ??, ??)
[00659D88]vm_map_deallocate+48 (??, ??)
[0011C588]kexitx+001408 (??)
[000BB08C]kexit+8C ()
___ Recovery (FFF9290) ___
WARNING: Eyecatcher/version mismatch in RWA


So there seems to be lock contention while unmapping memory segments.
My assumption was that Postgres is detaching all attached segments 
before exit (in shmem_exit callback or earlier).
I have added logging around proc_exit_prepare function (which is 
called by atexit callback) and check that it completes immediately.
So I thought that this vm_map_deallocate can be related with 
deallocation of normal (malloced) memory, because in Linux memory 
allocator may use mmap.

But in AIX it is not true.
Below is report of Bergamini Demien (once again a lot of thanks for 
help with investigation the problem):


The memory allocator in AIX libc does not use mmap and vm_relalias() 
is only called for shared memory mappings.


I talked with the AIX VMM expert at IBM and he said that what you hit 
is one of the most common performance bottlenecks in AIX memory 
management.


He also said that SysV Shared Memory (shmget/shmat) perform better on 
AIX than mmap.


Some improvements have been made in AIX 6.1 (see “perf suffers when 
procs sharing the same segs all exit at once”: 
http://www-01.ibm.com/support/docview.wss?uid=isg1IZ83819) but it does 
not help in your case.


In src/backend/port/sysv_shmem.c, it says that PostgreSQL 9.3 switched 
from using SysV Shared Memory to using mmap.


Maybe you could try to switch back to using SysV Shared Memory on AIX 
to see if it helps performance-wise.


Also, the good news is that there are some restricted tunables in AIX 
that can be tweaked to help different workloads which may have 
different demands.


One of them is relalias_percentage which works with force_relalias_lite:

# vmo -h relalias_percentage

Help for tunable relalias_percentage:

Purpose:

If force_relalias_lite is set to 0, then this specifies the factor 
used in the heuristic to decide whether to avoid locking the source 
mmapped segment or not.


Values:

Default: 0
Range: 0 - 32767
Type: Dynamic
Unit:

Tuning:

This is used when tearing down an mmapped region and is a scalability 
statement, where avoiding the lock may help system throughput, but, in 
some cases, at the cost of more compute time used. If the number of 
pages being unmapped is less than this value divided by 100 and 
multiplied by the total number of pages in memory in the source 
mmapped segment, then the source lock will be avoided. A value of 0 
for relalias_percentage, with force_relalias_lite also set to 0, will 
cause the source segment lock to always be taken. Effective values for 
relalias_percentage will vary by workload, however, a suggested value 
is: 200.



You may also try to play with the munmap_npages vmo tunable.
Your vmo settings for lgpg_size, lgpg_regions and v_pinshm already 
seem correct.



On 24.01.2017 18:08, Konstantin Knizhnik wrote:

Hi hackers,

Yet another story about AIX. For some reasons AIX very slowly 
cleaning zombie processes.
If we launch pgbench with -C parameter then very soon limit for 
maximal number of connections is exhausted.
If maximal number of connection is set to 1000, then after ten 
seconds of pgbench activity we get about 900 zombie processes and it 
takes about 100 seconds (!)

before all of them are terminated.

proctree shows a lot of defunt processes:

[14:44:41]root@postgres:~ # proctree 26084446
26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc
4784362 
4980786 
11403448 
11468930 
11993176 
12189710 
12517390 
13238374 
13565974 
13893826 postgres: wal writer process
14024716 
15401000 
...
25691556 

But ps shows that status of process is 

[14:46:02]root@postgres:~ # ps -elk | grep 25691556

  * A - 25691556 - - - - - 


Breakpoint set in reaper() function in postmaster shows that each 
invocation of this functions (called by SIGCHLD handler) proceed 5-10 
PIDS per invocation.
So there are two hypothesis: either AIX is very slowly delivering 
SIGCHLD to parent, either exit

Re: [HACKERS] Active zombies at AIX

2017-02-06 Thread Konstantin Knizhnik

Last update on the problem.
Using kdb tool (thank's to Tony Reix for advice and help) we get the 
following trace of Poastgres backend in existing stack trace:


pvthread+073000 STACK:
[005E1958]slock+000578 (005E1958, 80001032 [??])
[9558].simple_lock+58 ()
[00651DBC]vm_relalias+00019C (??, ??, ??, ??, ??)
[006544AC]vm_map_entry_delete+00074C (??, ??, ??)
[00659C30]vm_map_delete+000150 (??, ??, ??, ??)
[00659D88]vm_map_deallocate+48 (??, ??)
[0011C588]kexitx+001408 (??)
[000BB08C]kexit+8C ()
___ Recovery (FFF9290) ___
WARNING: Eyecatcher/version mismatch in RWA


So there seems to be lock contention while unmapping memory segments.
My assumption was that Postgres is detaching all attached segments 
before exit (in shmem_exit callback or earlier).
I have added logging around proc_exit_prepare function (which is called 
by atexit callback) and check that it completes immediately.
So I thought that this vm_map_deallocate can be related with 
deallocation of normal (malloced) memory, because in Linux memory 
allocator may use mmap.

But in AIX it is not true.
Below is report of Bergamini Demien (once again a lot of thanks  for 
help with investigation the problem):


The memory allocator in AIX libc does not use mmap and vm_relalias() is 
only called for shared memory mappings.


I talked with the AIX VMM expert at IBM and he said that what you hit is 
one of the most common performance bottlenecks in AIX memory management.


He also said that SysV Shared Memory (shmget/shmat) perform better on 
AIX than mmap.


Some improvements have been made in AIX 6.1 (see “perf suffers when 
procs sharing the same segs all exit at once”: 
http://www-01.ibm.com/support/docview.wss?uid=isg1IZ83819) but it does 
not help in your case.


In src/backend/port/sysv_shmem.c, it says that PostgreSQL 9.3 switched 
from using SysV Shared Memory to using mmap.


Maybe you could try to switch back to using SysV Shared Memory on AIX to 
see if it helps performance-wise.


Also, the good news is that there are some restricted tunables in AIX 
that can be tweaked to help different workloads which may have different 
demands.


One of them is relalias_percentage which works with force_relalias_lite:

# vmo -h relalias_percentage

Help for tunable relalias_percentage:

Purpose:

If force_relalias_lite is set to 0, then this specifies the factor used 
in the heuristic to decide whether to avoid locking the source mmapped 
segment or not.


Values:

Default: 0
Range: 0 - 32767
Type: Dynamic
Unit:

Tuning:

This is used when tearing down an mmapped region and is a scalability 
statement, where avoiding the lock may help system throughput, but, in 
some cases, at the cost of more compute time used. If the number of 
pages being unmapped is less than this value divided by 100 and 
multiplied by the total number of pages in memory in the source mmapped 
segment, then the source lock will be avoided. A value of 0 for 
relalias_percentage, with force_relalias_lite also set to 0, will cause 
the source segment lock to always be taken. Effective values for 
relalias_percentage will vary by workload, however, a suggested value 
is: 200.



You may also try to play with the munmap_npages vmo tunable.
Your vmo settings for lgpg_size, lgpg_regions and v_pinshm already seem 
correct.



On 24.01.2017 18:08, Konstantin Knizhnik wrote:

Hi hackers,

Yet another story about AIX. For some reasons AIX very slowly cleaning 
zombie processes.
If we launch pgbench with -C parameter then very soon limit for 
maximal number of connections is exhausted.
If maximal number of connection is set to 1000, then after ten seconds 
of pgbench activity we get about 900 zombie processes and it takes 
about 100 seconds (!)

before all of them are terminated.

proctree shows a lot of defunt processes:

[14:44:41]root@postgres:~ # proctree 26084446
26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc
4784362 
4980786 
11403448 
11468930 
11993176 
12189710 
12517390 
13238374 
13565974 
13893826 postgres: wal writer process
14024716 
15401000 
...
25691556 

But ps shows that status of process is 

[14:46:02]root@postgres:~ # ps -elk | grep 25691556

  * A - 25691556 - - - - - 


Breakpoint set in reaper() function in postmaster shows that each 
invocation of this functions (called by SIGCHLD handler) proceed 5-10 
PIDS per invocation.
So there are two hypothesis: either AIX is very slowly delivering 
SIGCHLD to parent, either exit of process takes too much time.


The fact the backends are in exiting state makes second hypothesis 
more reliable.
We have tried different Postgres configurations with local and TCP 
sockets, with different amount of shared buffers and built both with 
gcc and xlc.

In all cases behavior is similar: zombies do not want to die.
As far as it is not possible to attach debugger to defunct process, it 
is not clear how to understand what's going on.


I

Re: [HACKERS] logical decoding of two-phase transactions

2017-02-03 Thread Konstantin Knizhnik
On 02/04/2017 03:08 AM, Andres Freund wrote:
> On 2017-02-03 18:47:23 -0500, Robert Haas wrote:
>> On Fri, Feb 3, 2017 at 6:00 PM, Andres Freund <and...@anarazel.de> wrote:
>>> I still haven't seen a credible model for being able to apply a stream
>>> of interleaved transactions that can roll back individually; I think we
>>> really need the ability to have multiple transactions alive in one
>>> backend for that.
>> Hmm, yeah, that's a problem.  That smells like autonomous transactions.
> Unfortunately the last few proposals, like spawning backends, to deal
> with autonomous xacts aren't really suitable for replication, unless you
> only have very large ones.  And it really needs to be an implementation
> where ATs can freely be switched inbetween.  On the other hand, a good
> deal of problems (like locking) shouldn't be an issue, since there's
> obviously a possible execution schedule.
>
> I suspect this'd need some low-level implemention close to xact.c that'd
> allow switching between transactions.

Let me add my two coins here:

1. We are using logical decoding in our multimaster and applying transactions 
concurrently by pool of workers. Unlike asynchronous replication, in 
multimaster we need to perform voting for each transaction commit, so if 
transactions are applied by single
workers, then performance will be awful and, moreover, there is big chance to 
get "deadlock" when none of workers can complete voting because different nodes 
are performing voting for different transactions.

I could not say that there are no problems with this approach. There are 
definitely a lot of challenges. First of all we need to use special DTM 
(distributed transaction manager) to provide consistent applying of transaction 
at different nodes. Second
problem is once again related with kind of "deadlock" explained above. Even if 
we apply transactions concurrently, it is still possible to get such deadlock 
if we do not have enough workers. This is why we allow to launch extra workers 
dynamically (but
finally it is limited by maximal number of configures bgworkers).

But in any case, I think that "parallel apply" is "must have" mode for logical 
replication.

2. We have implemented autonomous transactions in PgPro EE. Unlike proposal 
currently present at commit fest, we execute autonomous transaction within the 
same backend. So we are just storing and restoring transaction context. 
Unfortunately it is also not
so cheap operation. Autonomous transaction should not see any changes done by 
parent transaction (because it can be rollbacked after commit of autonomous 
transaction). But there are catalog and relation caches inside backend, so we 
have to clean this
caches before switching to ATX. It is quite expensive operation and so speed of 
execution of PL/pg-SQL function with autonomous transaction is several order of 
magnitude slower than without it. So autonomous transaction can be used for 
audits (its the
primary goal of using ATX in Oracle PL/SQL applications) but this mechanism is 
not efficient for concurrent execution of multiple transaction in one backend.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] [WIP]Vertical Clustered Index (columnar store extension)

2017-02-03 Thread Konstantin Knizhnik
, during the select operation, the data 
in WOS
is converted into Local ROS for the statement to be executed. The 
conversion

cost depends upon the number of tuples present in the WOS file. This
may add some performance overhead for select statements. The life of 
the Local

ROS is till the end of query context.

Delete
=

During the delete operation, whenever the data is deleted in heap at 
the same
time the data in WOS file is marked as deleted similar like heap. But 
in case
if the data is already migrated from WOS to ROS, then we will maintain 
some
delete vector to store the details of tuple id, transaction 
information and etc.
During the data read from ROS file, it is verified against delete 
vector and

confirms whether the record is visible or not? All the delete vectors
data is applied to ROS periodically.

More details of internal relations and their usage is available in the 
README.
Still it needs more updates to explain full details of the columnar 
index design.


The concept of Vertical clustered index columnar extension is from 
Fujitsu Labs, Japan.


Following is the brief schedule of patches that are required
for a better performing columnar store.

1. Minimal server changes (new relkind "CSTORE" option)
2. Base storage patch
3. Support for moving data from WOS to ROS
4. Local ROS support
5. Custom scan support to read the data from ROS and Local ROS
6. Background worker support for data movement
7. Expression state support in VCI
8. Aggregation support in VCI
9. Pg_dump support for the new type of relations
10. psql \d command support for CSTORE relations
11. Parallelism support
12. Compression support
13. In-memory support with dynamic shared memory

Currently I attached only patches for 1 and 2. These will provide the
basic changes that are required in PostgreSQL core to the extension
to work.

I have to rebase/rewrite the rest of the patches to the latest master,
and share them with community.

Any Comments on the approach?

Regards,
Hari Babu
Fujitsu Australia




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-02 Thread Konstantin Knizhnik

Hi Tony,

On 02.02.2017 17:10, REIX, Tony wrote:


Hi Konstantin

I've discussed the "zombie/exit" issue with our expert here.

- He does not think that AIX has anything special here

- If the process is marked  in ps, this is because the flag 
SEXIT is set, thus the process is blocked somewhere in the kexitx() 
syscall, waiting for something.


- In order to know what it is waiting for, the best would be to have a 
look with *kdb*.




kdb shows the following stack:

pvthread+073000 STACK:
[005E1958]slock+000578 (005E1958, 80001032 [??])
[9558].simple_lock+58 ()
[00651DBC]vm_relalias+00019C (??, ??, ??, ??, ??)
[006544AC]vm_map_entry_delete+00074C (??, ??, ??)
[00659C30]vm_map_delete+000150 (??, ??, ??, ??)
[00659D88]vm_map_deallocate+48 (??, ??)
[0011C588]kexitx+001408 (??)
[000BB08C]kexit+8C ()
___ Recovery (FFF9290) ___
WARNING: Eyecatcher/version mismatch in RWA


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-02 Thread Konstantin Knizhnik
   staff 0 Feb  2 09:09 3256
-rw---1 pgstbf   staff 0 Feb  2 09:09 3256_vm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3257
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3258
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3394
-rw---1 pgstbf   staff 24576 Feb  2 09:09 3394_fsm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3394_vm
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3395
-rw---1 pgstbf   staff 32768 Feb  2 09:14 3455
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3456
-rw---1 pgstbf   staff 24576 Feb  2 09:09 3456_fsm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3456_vm
-rw---1 pgstbf   staff 0 Feb  2 09:09 3466
-rw---1 pgstbf   staff 0 Feb  2 09:09 3466_vm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3467
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3468
-rw---1 pgstbf   staff 0 Feb  2 09:09 3501
-rw---1 pgstbf   staff 0 Feb  2 09:09 3501_vm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3502
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3503
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3534
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3541
-rw---1 pgstbf   staff 24576 Feb  2 09:09 3541_fsm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3541_vm
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3542
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3574
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3575
-rw---1 pgstbf   staff 0 Feb  2 09:09 3576
-rw---1 pgstbf   staff 0 Feb  2 09:09 3576_vm
-rw---1 pgstbf   staff 0 Feb  2 09:09 3596
-rw---1 pgstbf   staff 0 Feb  2 09:09 3596_vm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3597
-rw---1 pgstbf   staff 0 Feb  2 09:09 3598
-rw---1 pgstbf   staff 0 Feb  2 09:09 3598_vm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3599
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3600
-rw---1 pgstbf   staff 24576 Feb  2 09:09 3600_fsm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3600_vm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3601
-rw---1 pgstbf   staff 24576 Feb  2 09:09 3601_fsm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3601_vm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3602
-rw---1 pgstbf   staff 24576 Feb  2 09:09 3602_fsm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3602_vm
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3603
-rw---1 pgstbf   staff 24576 Feb  2 09:09 3603_fsm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3603_vm
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3604
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3605
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3606
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3607
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3608
-rw---1 pgstbf   staff 32768 Feb  2 09:09 3609
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3712
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3764
-rw---1 pgstbf   staff 24576 Feb  2 09:09 3764_fsm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 3764_vm
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3766
-rw---1 pgstbf   staff 16384 Feb  2 09:09 3767
-rw---1 pgstbf   staff  8192 Feb  2 09:09 548
-rw---1 pgstbf   staff  8192 Feb  2 09:09 549
-rw---1 pgstbf   staff 0 Feb  2 09:09 826
-rw---1 pgstbf   staff 0 Feb  2 09:09 826_vm
-rw---1 pgstbf   staff  8192 Feb  2 09:09 827
-rw---1 pgstbf   staff  8192 Feb  2 09:09 828
-rw---1 pgstbf   staff 4 Feb  2 09:09 PG_VERSION
-rw---1 pgstbf   staff   512 Feb  2 09:09 pg_filenode.map
-rw---1 pgstbf   staff112660 Feb  2 09:09 pg_internal.init



Le 01/02/2017 à 21:26, Konstantin Knizhnik a écrit :

On 02/01/2017 08:30 PM, REIX, Tony wrote:


Hi Konstantin,



If you could*share scripts or instructions about the tests you are 
doing with pgbench*, I would reproduce here.




You do not need any script.
Just two simple commands.
One to initialize database:

pgbench -i -s 1000

And another to run benchmark itself:

pgbench -c 100 -j 20 -P 1 -T 10
...


Regards,

Tony


Le 01/02/2017 à 16:59, Konstantin Knizhnik a écrit :

Hi Tony,

On 01.02.2017 18:42, REIX, Tony wrote:


Hi Konstantin

*XLC.*

I'm on AIX 7.1 for now.

I'm using this version of

Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-02 Thread Konstantin Knizhnik
IPRA.$BackendRun() at 0x1001301ac
BackendStartup() at 0x10012f784
postmaster.IPRA.$do_start_bgworker.IPRA.$ServerLoop() at 0x10012fec4
PostmasterMain() at 0x100134760
main() at 0x106ac
(dbx) where
send(??, ??, ??, ??) at 0x910cfb4
secure_write() at 0x100141a94
IPRA.$internal_flush() at 0x10013e800
socket_flush() at 0x10013ee7c
ReadyForQuery@AF106_4() at 0x10033241c
PostgresMain() at 0x10006eb7c
IPRA.$BackendRun() at 0x1001301ac
BackendStartup() at 0x10012f784
postmaster.IPRA.$do_start_bgworker.IPRA.$ServerLoop() at 0x10012fec4
PostmasterMain() at 0x100134760
main() at 0x106ac
(dbx) cont

execution completed (exit code 1)
(dbx) quit
^C
(dbx) quit
libdebug assertion "(rc == DB_SUCCESS)" failed at line 162 in file 
../../../../../../../../../../../src/bos/usr/ccs/lib/libdbx/libdebug/modules/procdebug/ptrace/procdb_PtraceSession.C


I have no idea what's going on. It is release version without debug 
information, assert checks and lwlock info. So it is hard to debug it.
Heikki, I will be pleased if you have a chance to login at the system 
and look at it yourself.

May be you will have some idea what's happening...

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-01 Thread Konstantin Knizhnik
On 02/01/2017 08:28 PM, Heikki Linnakangas wrote:
>
> But if there's no pressing reason to change it, let's leave it alone. It's 
> not related to the problem at hand, right?
>

Yes, I agree with you: we should better leave it as it is.


-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-01 Thread Konstantin Knizhnik
On 02/01/2017 08:30 PM, REIX, Tony wrote:
>
> Hi Konstantin,
>
> Please run:*/opt/IBM/xlc/13.1.3/bin/xlc -qversion*  so that I know your exact 
> XLC v13 version.
>
IBM XL C/C++ for AIX, V13.1.3 (5725-C72, 5765-J07)

> I'm building on Power7 and not giving any architecture flag to XLC.
>
> I'm not using *-qalign=natural* . Thus, by default, XLC use -qalign=power, 
> which is close to natural, as explained at:
>  
> https://www.ibm.com/support/knowledgecenter/SSGH2K_13.1.0/com.ibm.xlc131.aix.doc/compiler_ref/opt_align.html
> Why are you using this flag ?
>

Because otherwise double type is aligned on 4 bytes.

> Thanks for info about *pgbench*. PostgreSQL web-site contains a lot of old 
> information...
>
> If you could*share scripts or instructions about the tests you are doing with 
> pgbench*, I would reproduce here.
>

You do not need any script.
Just two simple commands.
One to initialize database:

pgbench -i -s 1000

And another to run benchmark itself:

pgbench -c 100 -j 20 -P 1 -T 10


> I have no "real" application. My job consists in porting OpenSource packages 
> on AIX. Many packages. Erlang, Go, these days. I just want to make PostgreSQL 
> RPMs as good as possible... within the limited amount of time I can give to 
> this package, before
> moving to another one.
>
> About the *zombie* issue, I've discussed with my colleagues. Looks like the 
> process keeps zombie till the father looks at its status. However, though I 
> did that several times, I  do not remember well the details. And that should 
> be not specific to AIX.
> I'll discuss with another colleague, tomorrow, who should understand this 
> better than me.
>

1. Process is not in zomby state (according to ps). It is in  state... 
It is something AIX specific, I have not see processes in this state at Linux.
2. I have implemented simple test - forkbomb. It creates 1000 children and then 
wait for them. It is about ten times slower than at Intel/Linux, but still much 
faster than 100 seconds. So there is some difference between postgress backend 
and dummy process
doing nothing - just immediately terminating after return from fork()
>
> *Patch for Large Files*: When building PostgreSQL, I found required to use 
> the following patch so that PostgreSQL works with large files. I do not 
> remember the details. Do you agree with such a patch ? 1rst version (new-...) 
> shows the exact places where
>   define _LARGE_FILES 1  is required.  2nd version (new2-...) is simpler.
>
> I'm now experimenting with your patch for dead lock. However, that should be 
> invisible with the  "check-world" tests I guess.
>
> Regards,
>
> Tony
>
>
> Le 01/02/2017 à 16:59, Konstantin Knizhnik a écrit :
>> Hi Tony,
>>
>> On 01.02.2017 18:42, REIX, Tony wrote:
>>>
>>> Hi Konstantin
>>>
>>> *XLC.*
>>>
>>> I'm on AIX 7.1 for now.
>>>
>>> I'm using this version of *XL**C v13*:
>>>
>>> # xlc -qversion
>>> IBM XL C/C++ for AIX, V13.1.3 (5725-C72, 5765-J07)
>>> Version: 13.01.0003.0003
>>>
>>> With this version, I have (at least, since I tested with "check" and not 
>>> "check-world" at that time) 2 failing tests: create_aggregate , aggregates .
>>>
>>>
>>> With the following *XLC v12* version, I have NO test failure:
>>>
>>> # /usr/vac/bin/xlc -qversion
>>> IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)
>>> Version: 12.01..0016
>>>
>>>
>>> So maybe you are not using XLC v13.1.3.3, rather another sub-version. 
>>> Unless you are using more options for the configure ?
>>>
>>>
>>> *Configure*.
>>>
>>> What are the options that you give to the configure ?
>>>
>>>
>> export CC="/opt/IBM/xlc/13.1.3/bin/xlc"
>> export CFLAGS="-qarch=pwr8 -qtune=pwr8 -O2 -qalign=natural -q64 "
>> export LDFLAGS="-Wl,-bbigtoc,-b64"
>> export AR="/usr/bin/ar -X64"
>> export LD="/usr/bin/ld -b64 "
>> export NM="/usr/bin/nm -X64"
>> ./configure --prefix="/opt/postgresql/xlc-debug/9.6"
>>
>>
>>> *Hard load & 64 cores ?* OK. That clearly explains why I do not see this 
>>> issue.
>>>
>>>
>>> *pgbench ?* I wanted to run it. However, I'm still looking where to get it 
>>> plus a guide for using it for testing.
>>>
>>
>> pgbench is part of Postgres distributive (src/bin/pgbench)
>>
>>
>>> I would add such tests when building my PostgreSQL RPMs on AIX. So any

Re: [HACKERS] logical decoding of two-phase transactions

2017-02-01 Thread Konstantin Knizhnik
On 02/01/2017 10:32 PM, Tom Lane wrote:
> Robert Haas <robertmh...@gmail.com> writes:
>> Also, including the GID in the WAL for each COMMIT/ABORT PREPARED
>> doesn't seem inordinately expensive to me.
> I'm confused ... isn't it there already?  If not, how do we handle
> reconstructing 2PC state from WAL at all?
>
>   regards, tom lane
>
>
Right now logical decoding ignores prepare and take in account only "commit 
prepared":

/*
 * Currently decoding ignores PREPARE TRANSACTION and will just
 * decode the transaction when the COMMIT PREPARED is sent or
 * throw away the transaction's contents when a ROLLBACK PREPARED
 * is received. In the future we could add code to expose prepared
 * transactions in the changestream allowing for a kind of
 * distributed 2PC.
 */

For some scenarios it works well, but if we really need prepared state at 
replica (as in case of multimaster), then it is not enough.

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-01 Thread Konstantin Knizhnik



On 01.02.2017 15:39, Heikki Linnakangas wrote:


In summary, I came up with the attached. It's essentially your patch, 
with tweaks for the above-mentioned things. I don't have a powerpc 
system to test on, so there are probably some silly typos there.




Attached pleased find fixed version of your patch.
I verified that it is correctly applied, build and postgres normally 
works with it.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/include/port/atomics/arch-ppc.h b/src/include/port/atomics/arch-ppc.h
index ed1cd9d1b9..7cf8c8ef97 100644
--- a/src/include/port/atomics/arch-ppc.h
+++ b/src/include/port/atomics/arch-ppc.h
@@ -23,4 +23,11 @@
 #define pg_memory_barrier_impl()	__asm__ __volatile__ ("sync" : : : "memory")
 #define pg_read_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
 #define pg_write_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+
+#elif defined(__IBMC__) || defined(__IBMCPP__)
+
+#define pg_memory_barrier_impl()__sync()
+#define pg_read_barrier_impl()  __lwsync()
+#define pg_write_barrier_impl() __lwsync()
+
 #endif
diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h
index f854612d39..e1dd3310a5 100644
--- a/src/include/port/atomics/generic-xlc.h
+++ b/src/include/port/atomics/generic-xlc.h
@@ -48,7 +48,7 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 	 * consistency only, do not use it here.  GCC atomics observe the same
 	 * restriction; see its rs6000_pre_atomic_barrier().
 	 */
-	__asm__ __volatile__ ("	sync \n" ::: "memory");
+	__sync();
 
 	/*
 	 * XXX: __compare_and_swap is defined to take signed parameters, but that
@@ -73,11 +73,19 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 static inline uint32
 pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 {
+	uint32		ret;
+
 	/*
-	 * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby
-	 * providing sequential consistency.  This is undocumented.
+	 * Use __sync() before and __isync() after, like in compare-exchange
+	 * above.
 	 */
-	return __fetch_and_add((volatile int *)>value, add_);
+	__sync();
+
+	ret = __fetch_and_add((volatile int *)>value, add_);
+
+	__isync();
+
+	return ret;
 }
 
 #ifdef PG_HAVE_ATOMIC_U64_SUPPORT
@@ -89,7 +97,7 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
 {
 	bool		ret;
 
-	__asm__ __volatile__ ("	sync \n" ::: "memory");
+	__sync();
 
 	ret = __compare_and_swaplp((volatile long*)>value,
 			   (long *)expected, (long)newval);
@@ -103,7 +111,15 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
 static inline uint64
 pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
 {
-	return __fetch_and_addlp((volatile long *)>value, add_);
+	uint64		ret;
+
+	__sync();
+
+	ret = __fetch_and_addlp((volatile long *)>value, add_);
+
+	__isync();
+
+	return ret;
 }
 
 #endif /* PG_HAVE_ATOMIC_U64_SUPPORT */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 7aad2de..c6ef114 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -832,9 +831,8 @@ typedef unsigned int slock_t;
 #include 
 
 typedef int slock_t;
-
-#define TAS(lock)			_check_lock((slock_t *) (lock), 0, 1)
-#define S_UNLOCK(lock)		_clear_lock((slock_t *) (lock), 0)
+#define TAS(lock)			__check_lock_mp((slock_t *) (lock), 0, 1)
+#define S_UNLOCK(lock)		__clear_lock_mp((slock_t *) (lock), 0)
 #endif	 /* _AIX */

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-01 Thread Konstantin Knizhnik

Hi Tony,

On 01.02.2017 18:42, REIX, Tony wrote:


Hi Konstantin

*XLC.*

I'm on AIX 7.1 for now.

I'm using this version of *XL**C v13*:

# xlc -qversion
IBM XL C/C++ for AIX, V13.1.3 (5725-C72, 5765-J07)
Version: 13.01.0003.0003

With this version, I have (at least, since I tested with "check" and 
not "check-world" at that time) 2 failing tests: create_aggregate , 
aggregates .



With the following *XLC v12* version, I have NO test failure:

# /usr/vac/bin/xlc -qversion
IBM XL C/C++ for AIX, V12.1 (5765-J02, 5725-C72)
Version: 12.01..0016


So maybe you are not using XLC v13.1.3.3, rather another sub-version. 
Unless you are using more options for the configure ?



*Configure*.

What are the options that you give to the configure ?



export CC="/opt/IBM/xlc/13.1.3/bin/xlc"
export CFLAGS="-qarch=pwr8 -qtune=pwr8 -O2 -qalign=natural -q64 "
export LDFLAGS="-Wl,-bbigtoc,-b64"
export AR="/usr/bin/ar -X64"
export LD="/usr/bin/ld -b64 "
export NM="/usr/bin/nm -X64"
./configure --prefix="/opt/postgresql/xlc-debug/9.6"


*Hard load & 64 cores ?* OK. That clearly explains why I do not see 
this issue.



*pgbench ?* I wanted to run it. However, I'm still looking where to 
get it plus a guide for using it for testing.




pgbench is part of Postgres distributive (src/bin/pgbench)


I would add such tests when building my PostgreSQL RPMs on AIX. So any 
help is welcome !



*Performance*.

- Also, I'd like to compare PostgreSQL performance on AIX vs 
Linux/PPC64. Any idea how I should proceed ? Any PostgreSQL 
performance benchmark that I could find and use ? pgbench ?


pgbench is most widely used tool simulating OLTP workload. Certainly it 
is quite primitive and its results are rather artificial. TPC-C seems to 
be better choice.
But the best case is to implement your own benchmark simulating actual 
workload of your real application.


- I'm interested in any information for improving the performance & 
quality of my PostgreSQM RPMs on AIX./(As I already said, BullFreeware 
RPMs for AIX are free and can be used by anyone, like Perzl RPMs 
are//.My compa//ny (ATOS/Bull) sells IBM Power machines under the 
Escala brand s//ince ages (25 years this year)//)/.



*How to help ?*

How could I help for improving the quality and performance of 
PostgreSQL on AIX ?




We still have one open issue at AIX: see 
https://www.mail-archive.com/pgsql-hackers@postgresql.org/msg303094.html

It will be great if you can somehow help to fix this problem.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-01 Thread Konstantin Knizhnik

Hi,

We are using 13.1.3 version of XLC. All tests are passed.
Please notice that is is synchronization bug which can be reproduced 
only under hard load.
Our server has 64 cores and it is necessary to run pgbench with 100 
connections during several minutes to reproduce the problem.

So may be you just didn't notice it;)



On 01.02.2017 16:29, REIX, Tony wrote:


Hi,

I'm now working on the port of PostgreSQL on AIX.
(RPMs can be found, as free OpenSource work, at 
http://http://bullfreeware.com/ <http://http://bullfreeware.com/> .

http://bullfreeware.com/search.php?package=postgresql)

I was not aware of any issue with XLC v12 on AIX for atomic operations.
(XLC v13 generates at least 2 tests failures)

For now, with version 9.6.1, all tests "check-world", plus 
numeric_bigtest, are OK, in both 32 & 64bit versions.


Am I missing something ?

I configure the build of PostgreSQL with (in 64bits):

 ./configure
--prefix=/opt/freeware
--libdir=/opt/freeware/lib64
--mandir=/opt/freeware/man
--with-perl
--with-tcl
--with-tclconfig=/opt/freeware/lib
--with-python
--with-ldap
--with-openssl
--with-libxml
--with-libxslt
--enable-nls
--enable-thread-safety
--sysconfdir=/etc/sysconfig/postgresql

Am I missing some option for more optimization on AIX ?

Thanks

Regards,

Tony


Le 01/02/2017 à 12:07, Konstantin Knizhnik a écrit :

Attached please find my patch for XLC/AIX.
The most critical fix is adding __sync to pg_atomic_fetch_add_u32_impl.
The comment in this file says that:

  * __fetch_and_add() emits a leading "sync" and trailing "isync",
thereby
  * providing sequential consistency.  This is undocumented.

But it is not true any more (I checked generated assembler code in
debugger).
This is why I have added __sync() to this function. Now pgbench working
normally.

Also there is mysterious disappearance of assembler section function
with sync instruction from pg_atomic_compare_exchange_u32_impl.
I have fixed it by using __sync() built-in function instead.


Thanks to everybody who helped me to locate and fix this problem.

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


ATOS WARNING !
This message contains attachments that could potentially harm your 
computer.
Please make sure you open ONLY attachments from senders you know, 
trust and is in an e-mail that you are expecting.


AVERTISSEMENT ATOS !
Ce message contient des pièces jointes qui peuvent potentiellement 
endommager votre ordinateur.
Merci de vous assurer que vous ouvrez uniquement les pièces jointes 
provenant d’emails que vous attendez et dont vous connaissez les 
expéditeurs et leur faites confiance.


AVISO DE ATOS !
Este mensaje contiene datos adjuntos que pudiera ser que dañaran su 
ordenador.
Asegúrese de abrir SOLO datos adjuntos enviados desde remitentes de 
confianza y que procedan de un correo esperado.


ATOS WARNUNG !
Diese E-Mail enthält Anlagen, welche möglicherweise ihren Computer 
beschädigen könnten.
Bitte beachten Sie, daß Sie NUR Anlagen öffnen, von einem Absender 
den Sie kennen, vertrauen und vom dem Sie vor allem auch E-Mails mit 
Anlagen erwarten.







--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-01 Thread Konstantin Knizhnik

On 01.02.2017 15:39, Heikki Linnakangas wrote:

On 02/01/2017 01:07 PM, Konstantin Knizhnik wrote:

Attached please find my patch for XLC/AIX.
The most critical fix is adding __sync to pg_atomic_fetch_add_u32_impl.
The comment in this file says that:

   * __fetch_and_add() emits a leading "sync" and trailing "isync",
thereby
   * providing sequential consistency.  This is undocumented.

But it is not true any more (I checked generated assembler code in
debugger).
This is why I have added __sync() to this function. Now pgbench working
normally.


Seems like it was not so much undocumented, but an implementation 
detail that was not guaranteed after all..


Does __fetch_and_add emit a trailing isync there either? Seems odd if 
__compare_and_swap requires it, but __fetch_and_add does not. Unless 
we can find conclusive documentation on that, I think we should assume 
that an __isync() is required, too.


There was a long thread on these things the last time this was 
changed: 
https://www.postgresql.org/message-id/20160425185204.jrvlghn3jxulsb7i%40alap3.anarazel.de. 
I couldn't find an explanation there of why we thought that 
fetch_and_add implicitly performs sync and isync.



Also there is mysterious disappearance of assembler section function
with sync instruction from pg_atomic_compare_exchange_u32_impl.
I have fixed it by using __sync() built-in function instead.


__sync() seems more appropriate there, anyway. We're using intrinsics 
for all the other things in generic-xlc.h. But it sure is scary that 
the "asm" sections just disappeared.


In arch-ppc.h, shouldn't we have #ifdef __IBMC__ guards for the 
__sync() and __lwsync() intrinsics? Those are an xlc compiler-specific 
thing, right? Or if they are expected to work on any ppc compiler, 
then we should probably use them always, instead of the asm sections.


In summary, I came up with the attached. It's essentially your patch, 
with tweaks for the above-mentioned things. I don't have a powerpc 
system to test on, so there are probably some silly typos there.


Why do you prefer to use _check_lock instead of __check_lock_mp ?
First one is even not mentioned in XLC compiler manual:
http://www-01.ibm.com/support/docview.wss?uid=swg27046906=7
or
http://scv.bu.edu/computation/bluegene/IBMdocs/compiler/xlc-8.0/html/compiler/ref/bif_sync.htm



- Heikki





--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-02-01 Thread Konstantin Knizhnik

Attached please find my patch for XLC/AIX.
The most critical fix is adding __sync to pg_atomic_fetch_add_u32_impl.
The comment in this file says that:

  * __fetch_and_add() emits a leading "sync" and trailing "isync", 
thereby

  * providing sequential consistency.  This is undocumented.

But it is not true any more (I checked generated assembler code in 
debugger).
This is why I have added __sync() to this function. Now pgbench working 
normally.


Also there is mysterious disappearance of assembler section function 
with sync instruction from pg_atomic_compare_exchange_u32_impl.

I have fixed it by using __sync() built-in function instead.


Thanks to everybody who helped me to locate and fix this problem.

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/include/port/atomics/arch-ppc.h b/src/include/port/atomics/arch-ppc.h
index 2b54c42..5828f7e 100644
--- a/src/include/port/atomics/arch-ppc.h
+++ b/src/include/port/atomics/arch-ppc.h
@@ -23,4 +23,10 @@
 #define pg_memory_barrier_impl()	__asm__ __volatile__ ("sync" : : : "memory")
 #define pg_read_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
 #define pg_write_barrier_impl()		__asm__ __volatile__ ("lwsync" : : : "memory")
+
+#else
+#define pg_memory_barrier_impl()__sync()
+#define pg_read_barrier_impl()  __lwsync()
+#define pg_write_barrier_impl() __lwsync()
+
 #endif
diff --git a/src/include/port/atomics/generic-xlc.h b/src/include/port/atomics/generic-xlc.h
index f4fd2f3..531d17c 100644
--- a/src/include/port/atomics/generic-xlc.h
+++ b/src/include/port/atomics/generic-xlc.h
@@ -36,7 +36,8 @@ typedef struct pg_atomic_uint64
 #endif /* __64BIT__ */
 
 #define PG_HAVE_ATOMIC_COMPARE_EXCHANGE_U32
 static inline bool
 pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 	uint32 *expected, uint32 newval)
 {
@@ -48,14 +49,14 @@ pg_atomic_compare_exchange_u32_impl(volatile pg_atomic_uint32 *ptr,
 	 * consistency only, do not use it here.  GCC atomics observe the same
 	 * restriction; see its rs6000_pre_atomic_barrier().
 	 */
-	__asm__ __volatile__ ("	sync \n" ::: "memory");
+	__sync();
 
 	/*
 	 * XXX: __compare_and_swap is defined to take signed parameters, but that
 	 * shouldn't matter since we don't perform any arithmetic operations.
 	 */
 	ret = __compare_and_swap((volatile int*)>value,
@@ -77,6 +78,7 @@ pg_atomic_fetch_add_u32_impl(volatile pg_atomic_uint32 *ptr, int32 add_)
 	 * __fetch_and_add() emits a leading "sync" and trailing "isync", thereby
 	 * providing sequential consistency.  This is undocumented.
 	 */
+	__sync();
 	return __fetch_and_add((volatile int *)>value, add_);
 }
 
@@ -89,10 +91,10 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
 {
 	bool		ret;
 
-	__asm__ __volatile__ ("	sync \n" ::: "memory");
+	__sync();
 
 	ret = __compare_and_swaplp((volatile long*)>value,
			   (long *)expected, (long)newval);
 
 	__isync();
 
@@ -103,7 +105,8 @@ pg_atomic_compare_exchange_u64_impl(volatile pg_atomic_uint64 *ptr,
 static inline uint64
 pg_atomic_fetch_add_u64_impl(volatile pg_atomic_uint64 *ptr, int64 add_)
 {
-	return __fetch_and_addlp((volatile long *)>value, add_);
+__sync();
+return __fetch_and_addlp((volatile long *)>value, add_);
 }
 
 #endif /* PG_HAVE_ATOMIC_U64_SUPPORT */
diff --git a/src/include/storage/s_lock.h b/src/include/storage/s_lock.h
index 7aad2de..c6ef114 100644
--- a/src/include/storage/s_lock.h
+++ b/src/include/storage/s_lock.h
@@ -832,9 +831,8 @@ typedef unsigned int slock_t;
 #include 
 
 typedef int slock_t;
-
-#define TAS(lock)			_check_lock((slock_t *) (lock), 0, 1)
-#define S_UNLOCK(lock)		_clear_lock((slock_t *) (lock), 0)
+#define TAS(lock)			__check_lock_mp((slock_t *) (lock), 0, 1)
+#define S_UNLOCK(lock)		__clear_lock_mp((slock_t *) (lock), 0)
 #endif	 /* _AIX */
 
 

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-01-31 Thread Konstantin Knizhnik
ction begins immediately after __compare_and_swap().
 */
__isync();

return ret;
}

and if I compile this fuctions standalone, I get the following assembler 
code:


.pg_atomic_compare_exchange_u32_impl:   # 0x (H.4.NO_SYMBOL)
stdu   SP,-128(SP)
stdr3,176(SP)
stdr4,184(SP)
stdr5,192(SP)
ld r0,192(SP)
stwr0,192(SP)
   sync
ld r4,176(SP)
ld r3,184(SP)
lwzr0,192(SP)
extsw  r0,r0
lwar5,0(r3)
__L30:  # 0x0030 
(H.4.NO_SYMBOL+0x030)

lwarx  r6,r0,r4
cmpl   0,0,r5,r6
bc BO_IF_NOT,CR0_EQ,__L44
stwcx. r0,r0,r4
.machine"any"
bc BO_IF_NOT_3,CR0_EQ,__L30
__L44:  # 0x0044 
(H.4.NO_SYMBOL+0x044)

orir0,r6,0x
stwr0,0(r3)
mfcr   r0
rlwinm r0,r0,3,31,31
rldicl r0,r0,0,56
stbr0,112(SP)
isync
lbzr3,112(SP)
addi   SP,SP,128
bclr   BO_ALWAYS,CR0_LT

sync is here!


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-01-31 Thread Konstantin Knizhnik


On 30.01.2017 19:21, Heikki Linnakangas wrote:

On 01/24/2017 04:47 PM, Konstantin Knizhnik wrote:
Interesting.. What should happen here is that for the backend's own 
insertion slot, the "insertingat" value should be greater than the 
requested flush point ('upto' variable). That's because before 
GetXLogBuffer() calls AdvanceXLInsertBuffer(), it updates the 
backend's insertingat value, to the position that it wants to insert 
to. And AdvanceXLInsertBuffer() only calls 
WaitXLogInsertionsToFinish() with value smaller than what was passed 
as the 'upto' argument.



The comment to WaitXLogInsertionsToFinish says:

  * Note: When you are about to write out WAL, you must call this 
function
  * *before* acquiring WALWriteLock, to avoid deadlocks. This 
function might

  * need to wait for an insertion to finish (or at least advance to next
  * uninitialized page), and the inserter might need to evict an old WAL
buffer
  * to make room for a new one, which in turn requires WALWriteLock.

Which contradicts to the observed stack trace.


Not AFAICS. In the stack trace you showed, the backend is not holding 
WALWriteLock. It would only acquire it after the 
WaitXLogInsertionsToFinish() call finished.





Hmmm, may be I missed something.
I am not telling about WALBufMappingLock which is required after return 
from XLogInsertionsToFinish.
But about lock obtained by WALInsertLockAcquire  at line 946 in 
XLogInsertRecord.
It will be release at line  1021 by  WALInsertLockRelease(). But 
CopyXLogRecordToWAL is invoked with this lock granted.




This line in the stack trace is suspicious:


WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c"


AdvanceXLInsertBuffer() should only ever call 
WaitXLogInsertionsToFinish() with an xlog position that points to a 
page bounary, but that upto value points to the middle of a page.


Perhaps the value stored in the stack trace is not what the caller 
passed, but it was updated because it was past the 'reserveUpto' 
value? That would explain the "request to flush past end
of generated WAL" notices you saw in the log. Now, why would that 
happen, I have no idea.


If you can and want to provide me access to the system, I could have a 
look myself. I'd also like to see if the attached additional 
Assertions will fire.


I really get this assertion failed:

ExceptionalCondition(conditionName = "!(OldPageRqstPtr <= upto || 
opportunistic)", errorType = "FailedAssertion", fileName = "xlog.c", 
lineNumber = 1917), line 54 in "assert.c"

(dbx) up
unnamed block in AdvanceXLInsertBuffer(upto = 147439056632, 
opportunistic = '\0'), line 1917 in "xlog.c"

(dbx) p OldPageRqstPtr
147439058944
(dbx) p upto
147439056632
(dbx) p opportunistic
'\0'

Also , in another run, I encountered yet another assertion failure:

ExceptionalCondition(conditionName = "!NewPageBeginPtr) / 8192) % 
(XLogCtl->XLogCacheBlck + 1)) == nextidx)", errorType = 
"FailedAssertion", fileName = "xlog.c", lineNumber = 1950), line 54 in 
"assert.c"


nextidx equals to 1456, while expected value is 1457.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] logical decoding of two-phase transactions

2017-01-31 Thread Konstantin Knizhnik



On 31.01.2017 09:29, Michael Paquier wrote:

On Fri, Jan 27, 2017 at 8:52 AM, Craig Ringer <cr...@2ndquadrant.com> wrote:

Now, if it's simpler to just xlog the gid at COMMIT PREPARED time when
wal_level >= logical I don't think that's the end of the world. But
since we already have almost everything we need in memory, why not
just stash the gid on ReorderBufferTXN?

I have been through this thread... And to be honest, I have a hard
time understanding for which purpose the information of a 2PC
transaction is useful in the case of logical decoding. The prepare and
commit prepared have been received by a node which is at the root of
the cluster tree, a node of the cluster at an upper level, or a
client, being in charge of issuing all the prepare queries, and then
issue the commit prepared to finish the transaction across a cluster.
In short, even if you do logical decoding from the root node, or the
one at a higher level, you would care just about the fact that it has
been committed.

Sorry, may be I do not completely understand your arguments.
Actually our multimaster is completely based now on logical replication 
and 2PC (more precisely we are using 3PC now:)
State of transaction (prepared, precommitted, committed) should be 
persisted in WAL  to make it possible to perform recovery.
Recovery can involve transactions in any state. So there three records 
in the WAL: PREPARE, PRECOMMIT, COMMIT_PREPARED and
recovery can involve either all of them, either 
PRECOMMIT+COMMIT_PREPARED either just COMMIT_PREPARED.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Deadlock in XLogInsert at AIX

2017-01-24 Thread Konstantin Knizhnik

More information about the problem - Postgres log contains several records:

2017-01-24 19:15:20.272 MSK [19270462] LOG:  request to flush past end 
of generated WAL; request 6/AAEBE000, currpos 6/AAEBC2B0


and them correspond to the time when deadlock happen.
There is the following comment in xlog.c concerning this message:

/*
 * No-one should request to flush a piece of WAL that hasn't even been
 * reserved yet. However, it can happen if there is a block with a 
bogus

 * LSN on disk, for example. XLogFlush checks for that situation and
 * complains, but only after the flush. Here we just assume that to 
mean

 * that all WAL that has been reserved needs to be finished. In this
 * corner-case, the return value can be smaller than 'upto' argument.
 */

So looks like it should not happen.
The first thing to suspect is spinlock implementation which is different 
for GCC and XLC.
But ... if I rebuild Postgres without spinlocks, then the problem is 
still reproduced.


On 24.01.2017 17:47, Konstantin Knizhnik wrote:

Hi Hackers,

We are running Postgres at AIX and encoountered two strqange problems: 
active zombies process and deadlock in XLOG writer.
First problem I will explain in separate mail, now I am mostly 
concerning about deadlock.
It is irregularly reproduced with standard pgbench launched with 100 
connections.


It sometimes happens with 9.6 stable version of Postgres but only when 
it is compiled with xlc compiler.
We failed to reproduce the problem with GCC. So it looks like as bug 
in compiler or xlc-specific atomics implementation...

But there are few moments which contradicts with this hypothesis:

1. The problem is reproduce with Postgres built without optimization. 
Usually compiler bugs affect only optimized code.

2. Disabling atomics doesn't help.
3. Without optimization and with  LOCK_DEBUG defined time of 
reproducing the problem significantly increased. With optimized code 
it is almost always reproduced in few minutes.

With debug version it usually takes much more time.

But the most confusing thing is stack trace:

(dbx) where
semop(??, ??, ??) at 0x91f5790
PGSemaphoreLock(sema = 0x0a0044b95928), line 387 in "pg_sema.c"
unnamed block in LWLockWaitForVar(lock = 0x0a00d980, valptr = 
0x0a00d9a8, oldval = 102067874256, newval = 
0x0fff9c10), line 1666 in "lwlock.c"
LWLockWaitForVar(lock = 0x0a00d980, valptr = 
0x0a00d9a8, oldval = 102067874256, newval = 
0x0fff9c10), line 1666 in "lwlock.c"
unnamed block in WaitXLogInsertionsToFinish(upto = 102067874328), line 
1583 in "xlog.c"

WaitXLogInsertionsToFinish(upto = 102067874328), line 1583 in "xlog.c"
AdvanceXLInsertBuffer(upto = 102067874256, opportunistic = '\0'), line 
1916 in "xlog.c"

unnamed block in GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c"
GetXLogBuffer(ptr = 102067874256), line 1697 in "xlog.c"
CopyXLogRecordToWAL(write_len = 70, isLogSwitch = '\0', rdata = 
0x00011007ce10, StartPos = 102067874256, EndPos = 102067874328), 
line 1279 in "xlog.c"
XLogInsertRecord(rdata = 0x00011007ce10, fpw_lsn = 102067718328), 
line 1011 in "xlog.c"
unnamed block in XLogInsert(rmid = '\n', info = '@'), line 453 in 
"xloginsert.c"

XLogInsert(rmid = '\n', info = '@'), line 453 in "xloginsert.c"
log_heap_update(reln = 0x000110273540, oldbuf = 40544, newbuf = 
40544, oldtup = 0x0fffa2a0, newtup = 0x0001102bb958, 
old_key_tuple = (nil), all_visible_cleared = '\0', 
new_all_visible_cleared = '\0'), line 7708 in "heapam.c"
unnamed block in heap_update(relation = 0x000110273540, otid = 
0x0fffa6f8, newtup = 0x0001102bb958, cid = 1, crosscheck = 
(nil), wait = '^A', hufd = 0x0fffa5b0, lockmode = 
0x0fffa5c8), line 4212 in "heapam.c"
heap_update(relation = 0x000110273540, otid = 0x0fffa6f8, 
newtup = 0x0001102bb958, cid = 1, crosscheck = (nil), wait = '^A', 
hufd = 0x0fffa5b0, lockmode = 0x0fffa5c8), line 4212 
in "heapam.c"
unnamed block in ExecUpdate(tupleid = 0x0fffa6f8, oldtuple = 
(nil), slot = 0x0001102bb308, planSlot = 0x0001102b4630, 
epqstate = 0x0001102b2cd8, estate = 0x0001102b29e0, canSetTag 
= '^A'), line 937 in "nodeModifyTable.c"
ExecUpdate(tupleid = 0x0fffa6f8, oldtuple = (nil), slot = 
0x0001102bb308, planSlot = 0x0001102b4630, epqstate = 
0x0001102b2cd8, estate = 0x0001102b29e0, canSetTag = '^A'), 
line 937 in "nodeModifyTable.c"
ExecModifyTable(node = 0x0001102b2c30), line 1516 in 
"nodeModifyTable.c"

ExecProcNode(node = 0x0001102b2c30), line 396 in "execProcnode.c"
ExecutePlan(estate = 0x0001102b29e0, planstate = 
0x0001102b2c30, use_parallel_mode = '\0', operation = CMD_UPDATE,

Re: [HACKERS] Active zombies at AIX

2017-01-24 Thread Konstantin Knizhnik



On 24.01.2017 18:26, Tom Lane wrote:

Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes:

But ps shows that status of process is 
[14:46:02]root@postgres:~ # ps -elk | grep 25691556
   * A - 25691556 - - - - - 

As far as I could find by googling, this means that the process is
not actually a zombie yet, so it's not the postmaster's fault.

Apparently it's possible in some versions of AIX for an exiting process to
get stuck while releasing its reference to a socket, though I couldn't
find much detail about that.  I wonder how old your AIX is ...


It is AIX 7.1 (I expect that it is most recent version of AIX).




regards, tom lane


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Active zombies at AIX

2017-01-24 Thread Konstantin Knizhnik

Hi hackers,

Yet another story about AIX. For some reasons AIX very slowly cleaning 
zombie processes.
If we launch pgbench with -C parameter then very soon limit for maximal 
number of connections is exhausted.
If maximal number of connection is set to 1000, then after ten seconds 
of pgbench activity we get about 900 zombie processes and it takes about 
100 seconds (!)

before all of them are terminated.

proctree shows a lot of defunt processes:

[14:44:41]root@postgres:~ # proctree 26084446
26084446 /opt/postgresql/xlc/9.6/bin/postgres -D /postg_fs/postgresql/xlc
4784362 
4980786 
11403448 
11468930 
11993176 
12189710 
12517390 
13238374 
13565974 
13893826 postgres: wal writer process
14024716 
15401000 
...
25691556 

But ps shows that status of process is 

[14:46:02]root@postgres:~ # ps -elk | grep 25691556

 * A - 25691556 - - - - - 


Breakpoint set in reaper() function in postmaster shows that each 
invocation of this functions (called by SIGCHLD handler) proceed 5-10 
PIDS per invocation.
So there are two hypothesis: either AIX is very slowly delivering 
SIGCHLD to parent, either exit of process takes too much time.


The fact the backends are in exiting state makes second hypothesis more 
reliable.
We have tried different Postgres configurations with local and TCP 
sockets, with different amount of shared buffers and built both with gcc 
and xlc.

In all cases behavior is similar: zombies do not want to die.
As far as it is not possible to attach debugger to defunct process, it 
is not clear how to understand what's going on.


I wonder if somebody has encountered similar problems at AIX and may be 
can suggest some solution to solve this problem.

Thanks in advance

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[HACKERS] Deadlock in XLogInsert at AIX

2017-01-24 Thread Konstantin Knizhnik
;), line 1303 in "pquery.c"
PortalRunMulti(portal = 0x000110115e20, isTopLevel = '^A', 
setHoldSnapshot = '\0', dest = 0x0001102b7520, altdest = 
0x0001102b7520, completionTag = ""), line 1303 in "pquery.c"
unnamed block in PortalRun(portal = 0x000110115e20, count = 
9223372036854775807, isTopLevel = '^A', dest = 0x0001102b7520, 
altdest = 0x0001102b7520, completionTag = ""), line 815 in "pquery.c"
PortalRun(portal = 0x000110115e20, count = 9223372036854775807, 
isTopLevel = '^A', dest = 0x0001102b7520, altdest = 
0x0001102b7520, completionTag = ""), line 815 in "pquery.c"
unnamed block in exec_simple_query(query_string = "UPDATE 
pgbench_tellers SET tbalance = tbalance + 4019 WHERE tid = 6409;"), line 
1094 in "postgres.c"
exec_simple_query(query_string = "UPDATE pgbench_tellers SET tbalance = 
tbalance + 4019 WHERE tid = 6409;"), line 1094 in "postgres.c"
unnamed block in PostgresMain(argc = 1, argv = 0x000110119f68, 
dbname = "postgres", username = "postgres"), line 4076 in "postgres.c"
PostgresMain(argc = 1, argv = 0x000110119f68, dbname = "postgres", 
username = "postgres"), line 4076 in "postgres.c"

BackendRun(port = 0x000110114290), line 4279 in "postmaster.c"
BackendStartup(port = 0x000110114290), line 3953 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
unnamed block in ServerLoop(), line 1701 in "postmaster.c"
ServerLoop(), line 1701 in "postmaster.c"
PostmasterMain(argc = 3, argv = 0x0001100c6190), line 1309 in 
"postmaster.c"

main(argc = 3, argv = 0x0001100c6190), line 228 in "main.c"


As I already mentioned, we built Postgres with LOCK_DEBUG , so we can 
inspect lock owner. Backend is waiting for itself!

Now please look at two frames in this stack trace marked with red.
XLogInsertRecord is setting WALInsert locks at the beginning of the 
function:


if (isLogSwitch)
WALInsertLockAcquireExclusive();
else
WALInsertLockAcquire();

WALInsertLockAcquire just selects random item from WALInsertLocks array 
and exclusively locks:


if (lockToTry == -1)
lockToTry = MyProc->pgprocno % NUM_XLOGINSERT_LOCKS;
MyLockNo = lockToTry;
immed = LWLockAcquire([MyLockNo].l.lock, LW_EXCLUSIVE);

Then, following the stack trace, AdvanceXLInsertBuffer calls 
WaitXLogInsertionsToFinish:


/*
 * Now that we have an up-to-date LogwrtResult value, see if we
 * still need to write it or if someone else already did.
 */
if (LogwrtResult.Write < OldPageRqstPtr)
{
/*
 * Must acquire write lock. Release WALBufMappingLock 
first,
 * to make sure that all insertions that we need to 
wait for

 * can finish (up to this same position). Otherwise we risk
 * deadlock.
 */
LWLockRelease(WALBufMappingLock);

WaitXLogInsertionsToFinish(OldPageRqstPtr);

LWLockAcquire(WALWriteLock, LW_EXCLUSIVE);


It releases WALBufMappingLock but not WAL insert locks!
Finally in WaitXLogInsertionsToFinish tries to wait for all locks:

for (i = 0; i < NUM_XLOGINSERT_LOCKS; i++)
{
XLogRecPtrinsertingat = InvalidXLogRecPtr;

do
{
/*
 * See if this insertion is in progress. LWLockWait will 
wait for

 * the lock to be released, or for the 'value' to be set by a
 * LWLockUpdateVar call.  When a lock is initially 
acquired, its
 * value is 0 (InvalidXLogRecPtr), which means that we 
don't know

 * where it's inserting yet.  We will have to wait for it.  If
 * it's a small insertion, the record will most likely fit 
on the
 * same page and the inserter will release the lock without 
ever

 * calling LWLockUpdateVar.  But if it has to sleep, it will
 * advertise the insertion point with LWLockUpdateVar before
 * sleeping.
 */
if (LWLockWaitForVar([i].l.lock,
 [i].l.insertingAt,
 insertingat, ))

And here we stuck!
The comment to WaitXLogInsertionsToFinish says:

 * Note: When you are about to write out WAL, you must call this function
 * *before* acquiring WALWriteLock, to avoid deadlocks. This function might
 * need to wait for an insertion to finish (or at least advance to next
 * uninitialized page), and the inserter might need to evict an old WAL 
buffer

 * to make room for a new one, which in turn requires WALWriteLock.

Which contradicts to the observed stack trace.

I wonder if it is really synchronization bug in xlog.c or there is 
something wrong in this stack trace and it can not happen in case of 
normal work?


Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Unlogged tables cleanup

2016-11-09 Thread konstantin knizhnik

On Nov 10, 2016, at 10:17 AM, Michael Paquier wrote:

> 
> Hm.. I cannot reproduce what you see on Linux or macos. Perhaps you
> have locally a standby pointing as well to this tablespace?

No, it is latest sources from Postgres repository.
Please notice that you should create new database and tablespace to reproduce 
this issue.
So actually the whole sequence is

mkdir fs
initdb -D pgsql 
pg_ctl -D pgsql -l logfile start
psql postgres
# create tablespace fs location '/home/knizhnik/dtm-data/fs';
# set default_tablespace=fs;
# create unlogged table foo(x integer);
# insert into foo values(generate_series(1,10));
# ^D
pkill -9 postgres
pg_ctl -D pgsql -l logfile start
# select * from foo;


> -- 
> Michael



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Unlogged tables cleanup

2016-11-09 Thread Konstantin Knizhnik

Hi, hackers

I wonder if such behavior can be considered as a bug:

knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# create tablespace fs location '/home/knizhnik/dtm-data/fs';
CREATE TABLESPACE
postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo(x integer);
CREATE TABLE
postgres=# insert into foo values(generate_series(1,10));
INSERT 0 10



Now simulate server crash using using "pkill -9 postgres".

knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo;
ERROR:  could not open file 
"pg_tblspc/16384/PG_10_201611041/12289/16385": No such file or directory


knizhnik@knizhnik:~/dtm-data$ ls fs
PG_10_201611041
knizhnik@knizhnik:~/dtm-data$ ls fs/PG_10_201611041/


So all relation directory is removed!
It happens only for first table created in tablespace.
If you create table in Postgres data directory everything is ok: first 
segment of relation is truncated but not deleted.
Also if you create one more unlogged table in tablespace it is truncated 
correctly:


postgres=# set default_tablespace=fs;
SET
postgres=# create unlogged table foo1(x integer);
CREATE TABLE
postgres=# insert into foo1 values(generate_series(1,10));
INSERT 0 10
postgres=# \q
knizhnik@knizhnik:~/dtm-data$ pkill -9 postgres
knizhnik@knizhnik:~/dtm-data$ rm -f logfile ; pg_ctl -D pgsql.master -l 
logfile start

pg_ctl: another server might be running; trying to start server anyway
server starting
knizhnik@knizhnik:~/dtm-data$ psql postgres
psql (10devel)
Type "help" for help.

postgres=# select * from foo1;
 x
---
(0 rows)

knizhnik@knizhnik:~/dtm-data$ ls -l fs/PG_10_201611041/12289/*
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 fs/PG_10_201611041/12289/32768
-rw--- 1 knizhnik knizhnik 0 Nov  9 19:52 
fs/PG_10_201611041/12289/32768_init



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On conflict update & hint bits

2016-10-24 Thread Konstantin Knizhnik



On 24.10.2016 00:49, Peter Geoghegan wrote:

On Sun, Oct 23, 2016 at 2:46 PM, Tom Lane <t...@sss.pgh.pa.us> wrote:

What's bothering me is (a) it's less than 24 hours to release wrap and
(b) this patch changes SSI-relevant behavior and hasn't been approved
by Kevin.  I'm not familiar enough with that logic to commit a change
in it on my own authority, especially with so little time for problems
to be uncovered.

I should point out that I knew that the next set of point releases had
been moved forward much later than you did. I had to work on this fix
during the week, which was pretty far from ideal for me for my own
reasons.

Just for information: I know that you are working on this issue, but as 
far as we need to proceed further with our testing of multimaster,
I have done the following obvious changes and it fixes the problem (at 
least this assertion failure is not happen any more):


src/backend/executor/nodeModifyTable.c

@@ -1087,6 +1087,13 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 test = heap_lock_tuple(relation, , estate->es_output_cid,
lockmode, LockWaitBlock, false, ,
);
+/*
+ * We must hold share lock on the buffer content while examining tuple
+ * visibility.  Afterwards, however, the tuples we have found to be
+ * visible are guaranteed good as long as we hold the buffer pin.
+ */
+LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
 switch (test)
 {
 case HeapTupleMayBeUpdated:
@@ -1142,6 +1149,7 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
  * loop here, as the new version of the row might not conflict
  * anymore, or the conflicting tuple has actually been 
deleted.

  */
+LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
 ReleaseBuffer(buffer);
 return false;

@@ -1175,6 +1183,8 @@ ExecOnConflictUpdate(ModifyTableState *mtstate,
 /* Store target's existing tuple in the state's dedicated slot */
 ExecStoreTuple(, mtstate->mt_existing, buffer, false);

+LockBuffer(buffer, BUFFER_LOCK_UNLOCK);
+
 /*
  * Make tuple and any needed join variables available to ExecQual and
  * ExecProject.  The EXCLUDED tuple is installed in 
ecxt_innertuple, while


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] On conflict update & hint bits

2016-09-30 Thread Konstantin Knizhnik



On 30.09.2016 19:37, Peter Geoghegan wrote:

On Fri, Sep 30, 2016 at 5:33 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Later we try to check tuple visibility:

 ExecCheckHeapTupleVisible(estate, , buffer);

and inside HeapTupleSatisfiesMVCC try to set hint bit.

So, you're using repeatable read or serializable isolation level?


Repeatable read.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] On conflict update & hint bits

2016-09-30 Thread Konstantin Knizhnik

Hi,

I am faced with rarely reproduced problem at our multimaster (and never 
at vanilla Postgres).
We are using our own customized transaction manager, so it may be 
definitely the problem in our multimaster.
But stack trace looks suspiciously and this is why I want to consult 
with people familiar with this code whether it is bug in 
ExecOnConflictUpdate or not.


Briefly: ExecOnConflictUpdate tries to set hint bit without holding lock 
on the buffer and so get assertion failure in MarkBufferDirtyHint.


Now stack trace:

#0  0x7fe3b940acc9 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

#1  0x7fe3b940e0d8 in __GI_abort () at abort.c:89
#2  0x0097b996 in ExceptionalCondition (conditionName=0xb4d970 
"!(LWLockHeldByMe(((LWLock*) (&(bufHdr)->content_lock", 
errorType=0xb4d2e9 "FailedAssertion",

fileName=0xb4d2e0 "bufmgr.c", lineNumber=3380) at assert.c:54
#3  0x007e365b in MarkBufferDirtyHint (buffer=946, buffer_std=1 
'\001') at bufmgr.c:3380
#4  0x009c3660 in SetHintBits (tuple=0x7fe396a9d858, buffer=946, 
infomask=256, xid=1398) at tqual.c:136
#5  0x009c5194 in HeapTupleSatisfiesMVCC (htup=0x7ffc00169030, 
snapshot=0x2e79778, buffer=946) at tqual.c:1065
#6  0x006ace83 in ExecCheckHeapTupleVisible (estate=0x2e81ae8, 
tuple=0x7ffc00169030, buffer=946) at nodeModifyTable.c:197
#7  0x006ae343 in ExecOnConflictUpdate (mtstate=0x2e81d50, 
resultRelInfo=0x2e81c38, conflictTid=0x7ffc001690c0, planSlot=0x2e82428, 
excludedSlot=0x2e82428, estate=0x2e81ae8,

canSetTag=1 '\001', returning=0x7ffc001690c8) at nodeModifyTable.c:1173
#8  0x006ad256 in ExecInsert (mtstate=0x2e81d50, slot=0x2e82428, 
planSlot=0x2e82428, arbiterIndexes=0x2e7eeb0, 
onconflict=ONCONFLICT_UPDATE, estate=0x2e81ae8, canSetTag=1 '\001')

at nodeModifyTable.c:395
#9  0x006aebe3 in ExecModifyTable (node=0x2e81d50) at 
nodeModifyTable.c:1496


In ExecOnConflictUpdate buffer is pinned but not locked:

/*
 * Lock tuple for update.  Don't follow updates when tuple cannot be
 * locked without doing so.  A row locking conflict here means our
 * previous conclusion that the tuple is conclusively committed is not
 * true anymore.
 */
tuple.t_self = *conflictTid;
test = heap_lock_tuple(relation, , estate->es_output_cid,
   lockmode, LockWaitBlock, false, ,
   );

heap_lock_tuple is pinning buffer but not locking it:
 **buffer: set to buffer holding tuple (pinned but not locked at exit)

Later we try to check tuple visibility:

ExecCheckHeapTupleVisible(estate, , buffer);

and inside HeapTupleSatisfiesMVCC try to set hint bit.

MarkBufferDirtyHint assumes that buffer is locked:
 * 2. The caller might have only share-lock instead of exclusive-lock 
on the

 *  buffer's content lock.

and we get assertion failure in

/* here, either share or exclusive lock is OK */
Assert(LWLockHeldByMe(BufferDescriptorGetContentLock(bufHdr)));

So the question is whether it is correct that ExecOnConflictUpdate tries 
to access and update tuple without holding lock on the buffer?


Thank in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Handling dropped attributes in pglogical_proto

2016-09-28 Thread Konstantin Knizhnik

Hi,

pglogical_read_tuple from pglogical_proto.c contains the following code:

natts = pq_getmsgint(in, 2);
if (rel->natts != natts)
elog(ERROR, "tuple natts mismatch, %u vs %u", rel->natts, 
natts);



But if table was just altered and some attribute was removed from the 
table, then rel->natts can be greater than natts.

The problem can be fixed by the following patch:

--- a/pglogical_proto.c
+++ b/pglogical_proto.c
@@ -263,10 +263,15 @@ pglogical_read_tuple(StringInfo in, PGLogicalRelation 
*rel,
{
int attid = rel->attmap[i];
Form_pg_attribute att = desc->attrs[attid];
-   charkind = pq_getmsgbyte(in);
+   charkind;
const char *data;
int len;
 
+   if (att->atttypid == InvalidOid) {

+   continue;
+   }
+   kind = pq_getmsgbyte(in);
+
switch (kind)
{
case 'n': /* null */


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] UPSERT strange behavior

2016-08-25 Thread Konstantin Knizhnik

On 08/25/2016 10:08 PM, Peter Geoghegan wrote:

On Thu, Aug 25, 2016 at 11:49 AM, Tom Lane <t...@sss.pgh.pa.us> wrote:

I think the point is that given the way he's set up the test case,
there should be no duplicate violation in the plain unique index
unless there is one in the arbiter index.  So assuming that INSERT
tests the arbiter indexes first, there shouldn't be an error.
Maybe it doesn't do that, but it seems like it would be a good idea
if it did.

Oh, yeah. This is arguably an example of inference failing to infer
multiple unique indexes as arbiters. Inference could, in principle,
recognize that the second unique index is equivalent to the first, but
doesn't. (I don't think that it matters which order anything is tested
in, though, because not finding a dup value in the arbiter index does
not guarantee that there won't be one in the other index. There is no
locking when no conflict is initially found, and so no guarantees
here.)

Anyway, I don't have a lot of sympathy for this point of view, because
the scenario is completely contrived. You have to draw the line
somewhere.


I do not think that this scenario is completely contrived: the cases when a 
table has more than one primary key are quite common.
For example, "user" may have unique e-mail address, phone number and login.
Also, as far as I know, this is not an artificial example, but real case taken 
from customers application...

I am not sure weather it's really bug or feature, but the user's intention was 
obvious: locate record by one of the unique keys and if such record already 
exists,
then increment counter (do update instead of insert).  But there are also good 
arguments why upsert  may report conflict in this case...

If such UPSERT behavior is assumed to be correct, what is the best workaround 
for the problem if we really need to have to separate indexes and want to 
enforce unique constraint for both keys?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Logical decoding restart problems

2016-08-20 Thread konstantin knizhnik
Thank you for answers.

> No, you don't need to recreate them. Just advance your replication identifier 
> downstream and request a replay position in the future. Let the existing slot 
> skip over unwanted data and resume where you want to start replay.
> 
> You can advance the replication origins on the peers as you replay forwarded 
> xacts from your master.
> 
> Have a look at how the BDR code does this during "catchup mode" replay.
>  
> So while your problem discussed below seems concerning, you don't have to 
> drop and recreate slots like are currently doing. 

The only reason for recreation of slot is that I want to move it to the current 
"horizont" and skip all pending transaction without explicit specification of 
the restart position.
If I do not drop the slot and just restart replication specifying position 0/0 
(invalid LSN), then replication will be continued from the current slot 
position in WAL, will not it?
So there  is no way to specify something "start replication from the end of 
WAL", like lseek(0, SEEK_END).
Right now I trying to overcome this limitation by explicit calculation of the 
position from which we should continue replication.
But unfortunately the problem  with partly decoded transactions persist.
I will try at next week to create example reproducing the problem without any 
multimaster stuff, just using standard logical decoding plugin.


> 
> To restart logical decoding we first drop existed slot, then create new one 
> and then start logical replication from the WAL position 0/0 (invalid LSN).
> In this case recovery should be started from the last consistent point.
> 
> How do you create the new slot? SQL interface? walsender interface? Direct C 
> calls?

Slot is created by peer node using standard libpq connection with 
database=replication connection string.

>  
> The problem is that for some reasons consistent point is not so consistent 
> and we get partly decoded transactions.
> I.e. transaction body consists of two UPDATE but reorder_buffer extracts only 
> the one (last) update and sent this truncated transaction to destination 
> causing consistency violation at replica.  I started investigation of logical 
> decoding code and found several things which I do not understand.
> 
> Yeah, that sounds concerning and shouldn't happen.

I looked at replication code more precisely and understand that my first 
concerns were wrong.
Confirming flush position should not prevent replaying transactions with 
smaller LSNs.
But unfortunately the problem is really present. May be it is caused by race 
conditions (although most logical decoder data is local to backend).
This is why I will try to create reproducing scenario without multimaster.

> 
> Assume that we have transactions T1={start_lsn=100, end_lsn=400} and 
> T2={start_lsn=200, end_lsn=300}.
> Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
> If now we want to restart logical decoding, we can not start with position 
> less than 300, because CreateDecodingContext doesn't allow it:
> 
> 
> Right. You've already confirmed receipt of T2, so you can't receive it again.
>  
> So it means that we have no chance to restore T1?
> 
> Wrong. You can, because the slot's restart_lsn still be will be some LSN <= 
> 100. The slot keeps track of inprogress transactions (using xl_running_xacts 
> records) and knows it can't discard WAL past lsn 100 because xact T1 is still 
> in-progress, so it must be able to decode from the start of it.
> 
> When you create a decoding context decoding starts at restart_lsn not at 
> confirmed_flush_lsn. confirmed_flush_lsn is the limit at which commits start 
> resulting in decoded data being sent to you. So in your case, T1 commits at 
> lsn=400, which is >300, so you'll receive the whole xact for T1.

Yeh, but unfortunately it happens. Need to understand why...

>  
> It's all already there. See logical decoding's use of xl_running_xacts.


But how this information is persisted?
What will happen if wal_sender is restarted?

> 
> -- 
>  Craig Ringer   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Logical decoding restart problems

2016-08-19 Thread Konstantin Knizhnik



On 19.08.2016 19:06, Petr Jelinek wrote:


I don't think this will work correctly, there will be gap between when 
the new slot starts to decode and the drop of the old one as the new 
slot first needs to make snapshot.


Do I understand correctly that you are not using replication origins?


No, we are using replication origins to prevent recursive replication.
The gap is not a biggest problem because in multimaster we commit only 
those transaction which are acknowledged by all live nodes.
So if some transaction falls in the gap... it is unlucky and will be 
aborted.





The problem is that for some reasons consistent point is not so
consistent and we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts
only the one (last) update and sent this truncated transaction to
destination causing consistency violation at replica.  I started
investigation of logical decoding code and found several things which I
do not understand.


Never seen this happen. Do you have more details about what exactly is 
happening?




This is transaction at primary node:

root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node1 bash
postgres@9a04a0c9f246:/pg$  pg_xlogdump 
data/pg_xlog/00010001  | fgrep "tx: 6899"
rmgr: Heaplen (rec/tot):  7/53, tx:   6899, lsn: 
0/019EFBD8, prev 0/019EFB80, desc: LOCK off 12: xid 6899 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heaplen (rec/tot): 14/74, tx:   6899, lsn: 
0/019EFC10, prev 0/019EFBD8, desc: HOT_UPDATE off 12 xmax 6899 ; new off 
224 xmax 6899, blkref #0: rel 1663/12407/16421 blk 60
rmgr: Heaplen (rec/tot):  7/53, tx:   6899, lsn: 
0/019EFC60, prev 0/019EFC10, desc: LOCK off 153: xid 6899 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 49
rmgr: Heaplen (rec/tot): 14/82, tx:   6899, lsn: 
0/019EFC98, prev 0/019EFC60, desc: UPDATE off 153 xmax 6899 ; new off 55 
xmax 6899, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel 
1663/12407/16421 blk 49
rmgr: Btree   len (rec/tot):  2/64, tx:   6899, lsn: 
0/019EFCF0, prev 0/019EFC98, desc: INSERT_LEAF off 294, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):236/   265, tx:   6899, lsn: 
0/019EFD30, prev 0/019EFCF0, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1AC0E70: invalid record 
length at 0/1AC0EA8: wanted 24, got 0



This is the replicated transaction at other node (it is ont clear from 
the trace, but believe me, it is the same transaction):



root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node2 bash
postgres@e5b16d82ce06:/pg$ pg_xlogdump 
data/pg_xlog/00010001  | fgrep "tx:   6882"
rmgr: Heaplen (rec/tot): 14/74, tx:   6882, lsn: 
0/019F3240, prev 0/019F31F0, desc: HOT_UPDATE off 113 xmax 6882 ; new 
off 219 xmax 0, blkref #0: rel 1663/12407/16421 blk 53
rmgr: Heaplen (rec/tot): 14/82, tx:   6882, lsn: 
0/019F5CB8, prev 0/019F5C60, desc: UPDATE off 163 xmax 6882 ; new off 4 
xmax 0, blkref #0: rel 1663/12407/16421 blk 62, blkref #1: rel 
1663/12407/16421 blk 51
rmgr: Btree   len (rec/tot):  2/64, tx:   6882, lsn: 
0/019F5D10, prev 0/019F5CB8, desc: INSERT_LEAF off 284, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):248/   274, tx:   6882, lsn: 
0/019F61F8, prev 0/019F60E8, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1AD3AD8: invalid record 
length at 0/1AD3B10: wanted 24, got 0



And "shorten" version of the same transaction at the third (recovered) node:

root@knizhnik:/home/knizhnik/postgres_cluster/contrib/mmts# docker exec 
-ti node3 bash
postgres@b4066d4211bc:/pg$  pg_xlogdump 
data/pg_xlog/00010001  | fgrep "tx: 6893"
rmgr: Heaplen (rec/tot):  7/53, tx:   6893, lsn: 
0/01A29828, prev 0/01A297E0, desc: LOCK off 172: xid 6893 LOCK_ONLY 
EXCL_LOCK , blkref #0: rel 1663/12407/16421 blk 50
rmgr: Heaplen (rec/tot): 14/82, tx:   6893, lsn: 
0/01A29860, prev 0/01A29828, desc: UPDATE+INIT off 172 xmax 6893 ; new 
off 1 xmax 6893, blkref #0: rel 1663/12407/16421 blk 64, blkref #1: rel 
1663/12407/16421 blk 50
rmgr: Btree   len (rec/tot):  2/64, tx:   6893, lsn: 
0/01A298B8, prev 0/01A29860, desc: INSERT_LEAF off 314, blkref #0: rel 
1663/12407/16424 blk 23
rmgr: Transaction len (rec/tot):236/   265, tx:   6893, lsn: 
0/01A298F8, prev 0/01A298B8, desc: PREPARE
pg_xlogdump: FATAL:  error in WAL record at 0/1ACBBF8: invalid record 
length at 0/1ACBC30: wanted 24, got 0


You can see one update instead of two.

Sorry, I have not saved trace  with output of logical decoder. Bu t it 
really decodes just one update!

What I have done:

DROP_REPLICATION_SLOT "mtm_slot_1";
CREATE_

[HACKERS] Logical decoding restart problems

2016-08-19 Thread konstantin knizhnik
Hi,

We are using logical decoding in multimaster and we are faced with the problem 
that inconsistent transactions are sent to replica.
Briefly, multimaster is using logical decoding in this way:
1. Each multimaster node is connected with each other using logical decoding 
channel and so each pair of nodes 
has its own replication slot.
2. In normal scenario each replication channel is used to replicate only those 
transactions which were originated at the source node.
We are using origin mechanism to skip "foreign" transactions.
2. When offline cluster node is returned back to the multimaster we need to 
recover this node to the current cluster state.
Recovery is performed from one of the cluster's node. So we are using only one 
replication channel to receive all (self and foreign) transactions.
Only in this case we can guarantee consistent order of applying transactions at 
recovered node.
After the end of recovery we need to recreate replication slots with all other 
cluster nodes (because we have already replied transactions from this nodes).
To restart logical decoding we first drop existed slot, then create new one and 
then start logical replication from the WAL position 0/0 (invalid LSN).
In this case recovery should be started from the last consistent point.

The problem is that for some reasons consistent point is not so consistent and 
we get partly decoded transactions.
I.e. transaction body consists of two UPDATE but reorder_buffer extracts only 
the one (last) update and sent this truncated transaction to destination 
causing consistency violation at replica.  I started investigation of logical 
decoding code and found several things which I do not understand.

Assume that we have transactions T1={start_lsn=100, end_lsn=400} and 
T2={start_lsn=200, end_lsn=300}.
Transaction T2 is sent to the replica and replica confirms that flush_lsn=300.
If now we want to restart logical decoding, we can not start with position less 
than 300, because CreateDecodingContext doesn't allow it:

 * start_lsn
 *  The LSN at which to start decoding.  If InvalidXLogRecPtr, 
restart
 *  from the slot's confirmed_flush; otherwise, start from the 
specified
 *  location (but move it forwards to confirmed_flush if it's older 
than
 *  that, see below).
 *
else if (start_lsn < slot->data.confirmed_flush)
{
/*
 * It might seem like we should error out in this case, but it's
 * pretty common for a client to acknowledge a LSN it doesn't 
have to
 * do anything for, and thus didn't store persistently, because 
the
 * xlog records didn't result in anything relevant for logical
 * decoding. Clients have to be able to do that to support 
synchronous
 * replication.
 */

So it means that we have no chance to restore T1?
What is worse, if there are valid T2 transaction records with lsn >= 300, then 
we can partly decode T1 and send this T1' to the replica.
I missed something here?

Are there any alternative way to "seek" slot to the proper position without  
actual fetching data from it or recreation of the slot?
Is there any mechanism in xlog which can enforce consistent decoding of 
transaction (so that no transaction records are missed)?
May be I missed something but I didn't find any "record_number" or something 
else which can identify first record of transaction.

Thanks in advance,
Konstantin Knizhnik,
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [HACKERS] WIP: Barriers

2016-08-15 Thread Konstantin Knizhnik


On 15.08.2016 15:42, Thomas Munro wrote:
This implementation is using a spinlock for the arrival counter, and 
signals (via Robert's condition variables and latches) for waking up 
peer processes when the counter reaches the target. I realise that 
using signals for this sort of thing is a bit unusual outside the 
Postgres universe, but won't a semaphore-based implementation require 
just as many system calls, context switches and scheduling operations? 


Yes, you are right.
I never expected that this combination of signal+local socket+select can 
provide performance comparable with pthread_cond_t.
I have implemented simple test where two background workers are 
emulating request-response round-trip using latches and pthread primitives.
Result (average round-trip time) was 7.49 microseconds for Postgres 
latches vs. 4.59 microseconds for pthread_cond_timedwait.





#define N_ROUNDTRIPS 100
#define WAIT_LATCH_TIMEOUT 6

static void PongLatch(Datum arg)
{
int i;
timestamp_t start;
int result;

BackgroundWorkerUnblockSignals();

Mtm->pong = MyProc->pgprocno;
ResetLatch(>procLatch);
MtmSleep(100);
Assert(Mtm->ping);


for (i = 0; i <= N_ROUNDTRIPS; i++) {
result = WaitLatch(>procLatch, WL_LATCH_SET|WL_TIMEOUT, 
WAIT_LATCH_TIMEOUT);

Assert(result & WL_LATCH_SET);
ResetLatch(>procLatch);
SetLatch(>allProcs[Mtm->ping].procLatch);
if (i == 0) {
   start = MtmGetSystemTime();
}
}
fprintf(stderr, "Average roundrip time: %f microsconds\n", 
(double)(MtmGetSystemTime() - start) /  N_ROUNDTRIPS);

}

static void PingLatch(Datum arg)
{
int i;
timestamp_t start;
int result;

BackgroundWorkerUnblockSignals();

Mtm->ping = MyProc->pgprocno;
ResetLatch(>procLatch);
MtmSleep(100);
Assert(Mtm->pong);

for (i = 0; i <= N_ROUNDTRIPS; i++) {
SetLatch(>allProcs[Mtm->pong].procLatch);
result = WaitLatch(>procLatch, WL_LATCH_SET|WL_TIMEOUT, 
WAIT_LATCH_TIMEOUT);

Assert(result & WL_LATCH_SET);
ResetLatch(>procLatch);
if (i == 0) {
   start = MtmGetSystemTime();
}
}
fprintf(stderr, "Average roundrip time: %f microseconds\n", 
(double)(MtmGetSystemTime() - start) /  N_ROUNDTRIPS);

}


static BackgroundWorker Pinger = {
"ping",
BGWORKER_SHMEM_ACCESS,// | BGWORKER_BACKEND_DATABASE_CONNECTION,
BgWorkerStart_ConsistentState,
BGW_NEVER_RESTART,
PingLatch
};

static BackgroundWorker Ponger = {
"pong",
BGWORKER_SHMEM_ACCESS,// | BGWORKER_BACKEND_DATABASE_CONNECTION,
BgWorkerStart_ConsistentState,
BGW_NEVER_RESTART,
PongLatch
};

static void PingPong()
{
RegisterBackgroundWorker();
RegisterBackgroundWorker();
}




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Barriers

2016-08-14 Thread konstantin knizhnik
Hi Thomas,

Barriers are really very simple and convenient mechanism for process 
synchronization.
But it is actually a special case of semaphores: having semaphore primitive it 
is trivial to implement a barrier.
We have semaphores in Postgres, but ... them can not be used by extensions: 
there is fixed number of semaphores allocated based on maximal number of 
connections and there is no mechanism for requesting additional semaphores. 


Rober has recently proposed conditional variables, which are also very useful. 
Right now we have spinlocks, LW-locks and latches.
>From my point of view it is not enough. While creating various extensions for 
>Postgres I always fill lack of such synchronization primitive as events 
>(condition variables) and semaphores. Events and semaphores are similar and it 
>is possible to implement any of them based on another. But from user's point 
>of view them have different semantic and use cases, so it is better to provide 
>both of them.

I wonder if we should provide system-abstraction-layer (SAL) for Postgres, 
where we can isolate all system dependent code and which will be available for 
vore developers as well as for developers of extensions?  The obvious 
candidates for SAL are:
1. Synchronization primitives (locks, events, semaphores, mutexes, spinlocks, 
latches, barriers)
2. Shared memory
3. File access
4. Network sockets
5. Process control (fork, ...)

Certainly it requires a lot of refactoring but will make Postgres code much 
more elegant, easer to read and maintain.
Also it is not necessary to do all this changes  in one step: here we do not 
need atomic transactions:)
We can start for example with synchronization primitives, as far as in any case 
a lot of changes are proposed here.

Parallel execution is one of the most promising approach to improve Postgres 
performance. I do not mean just parallel execution of single query.
Parallel vacuum, parallel index creation, parallel sort, ... And to implement 
all this stuff we definitely need convenient and efficient synchronization 
primitives.
The set of such primitives can be discussed. IMHO it should include RW-locks 
(current LW-locks), mutexes (spinlock + some mechanism to wait), events 
(condition variables), semaphores and barriers (based on semaphores). Latches 
can be left for backward compatibility or be replaced with events.
I wonder if somebody has measured  how much times latches (signal+socket) are 
slower then posix semaphores or conditional variables?



On Aug 14, 2016, at 2:18 AM, Thomas Munro wrote:

> Hi hackers,
> 
> I would like to propose "barriers" for Postgres processes.  A barrier
> is a very simple mechanism for coordinating parallel computation, as
> found in many threading libraries.
> 
> First, you initialise a Barrier object somewhere in shared memory,
> most likely in the DSM segment used by parallel query, by calling
> BarrierInit(, nworkers).  Then workers can call
> BarrierWait() when they want to block until all workers arrive
> at the barrier.  When the final worker arrives, BarrierWait returns in
> all workers, releasing them to continue their work.  One arbitrary
> worker receives a different return value as a way of "electing" it to
> perform serial phases of computation.  For parallel phases of
> computation, the return value can be ignored.  For example, there may
> be preparation, merging, or post-processing phases which must be done
> by just one worker, interspersed with phases where all workers do
> something.
> 
> My use case for this is coordinating the phases of parallel hash
> joins, but I strongly suspect there are other cases.  Parallel sort
> springs to mind, which is why I wanted to post this separately and
> earlier than my larger patch series, to get feedback from people
> working on other parallel features.
> 
> A problem that I'm still grappling with is how to deal with workers
> that fail to launch.  What I'm proposing so far is based on static
> worker sets, where you can only give the number of workers at
> initialisation time, just like pthread_barrier_init.  Some other
> libraries allow for adjustable worker sets, and I'm wondering if a
> parallel leader might need to be able to adjust the barrier when it
> hears of a worker not starting.  More on that soon.
> 
> Please see the attached WIP patch.  I had an earlier version with its
> own waitlists and signalling machinery etc, but I've now rebased it to
> depend on Robert Haas's proposed condition variables, making this code
> much shorter and sweeter.  So it depends on his
> condition-variable-vX.patch[1], which in turn depends on my
> lwlocks-in-dsm-vX.patch[2] (for proclist).
> 
> When Michaël Paquier's work on naming wait points[3] lands, I plan to
> include event IDs as an extra argument to BarrierWait which will be
> passed though so as to show up in pg_stat_activity.  Then you'll be
> able to see where workers are waiting for each other!  For now I
> didn't want to tangle this up with yet 

Re: [HACKERS] Restriction of windows functions

2016-06-17 Thread Konstantin Knizhnik



On 17.06.2016 17:01, Tom Lane wrote:


Certainly it assumes that window is ordered by key and the key type
supports subtraction, so "text" can not be used here.
IIRC, the sticking point was defining a reasonably datatype-independent
(i.e. extensible) notion of distance.


Why it is not possible just to locate "-" or "+ operator for this type?
I do not see any difference here with locating comparison operator 
needed for sorting.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Restriction of windows functions

2016-06-17 Thread Konstantin Knizhnik

Hi,

One of the popular queries in financial analytic systems is to calculate 
some moving aggregate within some time interval, i.e. moving average of 
trade price within 5 minutes window. Unfortunately this case is not 
supported by PostgreSQL:


select symbol,date,avg(price) over (order by date range between '5 
minutes' preceding and current row) from Trades;

ERROR:  RANGE PRECEDING is only supported with UNBOUNDED

Is there some principle problem in implementing such kind of window?
May be I missed something, but it seems to me that it should not be very 
difficult.
There is update_frameheadpos function which adjusts head position of 
windows in "rows" mode and

reports error in rows mode:

if (frameOptions & FRAMEOPTION_ROWS)
{
/* In ROWS mode, bound is physically n before/after current */
int64offset = 
DatumGetInt64(winstate->startOffsetValue);


if (frameOptions & FRAMEOPTION_START_VALUE_PRECEDING)
offset = -offset;

winstate->frameheadpos = winstate->currentpos + offset;
/* frame head can't go before first row */
if (winstate->frameheadpos < 0)
winstate->frameheadpos = 0;
else if (winstate->frameheadpos > winstate->currentpos)
{
/* make sure frameheadpos is not past end of partition */
spool_tuples(winstate, winstate->frameheadpos - 1);
if (winstate->frameheadpos > winstate->spooled_rows)
winstate->frameheadpos = winstate->spooled_rows;
}
winstate->framehead_valid = true;
}
else if (frameOptions & FRAMEOPTION_RANGE)
{
/* parser should have rejected this */
elog(ERROR, "window frame with value offset is not 
implemented");

}
else

The straightforward approach to support range mode is to advance head 
position until  "distance" between head and current row is less or 
equal  than specified range value. Looks like not something too complex 
to implement, doesn't it? Are there some caveats?
Certainly it assumes that window is ordered by key and the key type 
supports subtraction, so "text" can not be used here.

Something else?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array of domain types

2016-06-03 Thread Konstantin Knizhnik

On 03.06.2016 02:02, Rod Taylor wrote:



On Thu, Jun 2, 2016 at 10:42 AM, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:


On 02.06.2016 17:22, Tom Lane wrote:

    konstantin knizhnik <k.knizh...@postgrespro.ru
<mailto:k.knizh...@postgrespro.ru>> writes:

Attached please find patch for DefineDomain function.

You didn't attach the patch,


Sorry, but I did attached the patch - I see the attachment in my
mail received from the group.
Multidimensional arrays work fine:

knizhnik=# SELECT '{{14},{20}}'::teenager[][];
ERROR:  value for domain teenager violates check constraint
"teenager_check"
LINE 1: SELECT '{{14},{20}}'::teenager[][];
   ^
knizhnik=# SELECT '{{14},{19}}'::teenager[][];
  teenager
-
 {{14},{19}}
(1 row)

knizhnik=# SELECT ('{{14},{19}}'::teenager[][])[1][1];
 teenager
--
   14
(1 row)


Domain of array of domain also works:


I applied the domain.patch from above on HEAD, and all I get is cache 
lookup failures. The type_sanity regression test fails too.


postgres=# CREATE DOMAIN teenager AS int CHECK (VALUE BETWEEN 13 AND 20);
CREATE DOMAIN
postgres=# CREATE DOMAIN teenager_groups AS teenager[];
CREATE DOMAIN
postgres=# CREATE TABLE x (col teenager_groups);
ERROR:  cache lookup failed for type 0


Anyway, if that worked for me I would have done this which I expect 
will succeed when it shouldn't.


INSERT INTO x VALUES (ARRAY[13,14,20]);
ALTER DOMAIN teenager DROP CONSTRAINT teenager_check;
ALTER DOMAIN teenager ADD CHECK (VALUE BETWEEN 13 AND 19);



Sorry, the problem is more difficult than I originally expected:(
Attached patch passes all regression tests and correctly handle 
conversion of arrays.
But constraints are not checked for table columns. I failed to locate 
place where this check should be inserted...


Originally I was mostly interested in domains as kind of typedefs: 
convenient way to assign type to some particular kind of columns,

for example object reference used in ORM.
There are two main goals of using domain here:
1. Be able to easily change representation of object identifier, for 
example from integer to bigint.
2. Detect all columns containing references (distinguish them from 
columns containing just normal integers).
I do not see any other mechanism in PostgreSQL which can address this 
problem (for example user defined type can not help here).


I wonder if it is possible to support arrays of domain which do not have 
constraints?

Or such partial support is worser than prohibiting arrays of domains at all?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/commands/typecmds.c b/src/backend/commands/typecmds.c
index 71d4df9..e227fa8 100644
--- a/src/backend/commands/typecmds.c
+++ b/src/backend/commands/typecmds.c
@@ -752,11 +752,13 @@ DefineDomain(CreateDomainStmt *stmt)
 	ListCell   *listptr;
 	Oid			basetypeoid;
 	Oid			old_type_oid;
+	Oid array_oid;
 	Oid			domaincoll;
 	Form_pg_type baseType;
 	int32		basetypeMod;
 	Oid			baseColl;
 	ObjectAddress address;
+	char   *array_type;
 
 	/* Convert list of names to a name and namespace */
 	domainNamespace = QualifiedNameGetCreationNamespace(stmt->domainname,
@@ -1023,6 +1025,14 @@ DefineDomain(CreateDomainStmt *stmt)
 		}
 	}
 
+
+	/*
+	 * OK, we're done checking, time to make the type.  We must assign the
+	 * array type OID ahead of calling TypeCreate, since the base type and
+	 * array type each refer to the other.
+	 */
+	array_oid = AssignTypeArrayOid();
+
 	/*
 	 * Have TypeCreate do all the real work.
 	 */
@@ -1047,7 +1057,7 @@ DefineDomain(CreateDomainStmt *stmt)
    analyzeProcedure,	/* analyze procedure */
    InvalidOid,	/* no array element type */
    false,		/* this isn't an array */
-   InvalidOid,	/* no arrays for domains (yet) */
+   array_oid,	/* array type we are about to create */
    basetypeoid, /* base type ID */
    defaultValue,	/* default type value (text) */
    defaultValueBin,		/* default type value (binary) */
@@ -1060,6 +1070,48 @@ DefineDomain(CreateDomainStmt *stmt)
    domaincoll); /* type's collation */
 
 	/*
+	 * Create the array type that goes with it.
+	 */
+	array_type = makeArrayTypeName(domainName, domainNamespace);
+
+/* alignment must be 'i' or 'd' for arrays */
+	alignment = (alignment == 'd') ? 'd' : 'i';
+
+	TypeCreate(array_oid,/* force assignment of this type OID */
+			   array_type,/* type name */
+			   domainNamespace,/* namespace */
+			   InvalidOid,/* relation oid (n/a here) */
+			   0,/* relation kind (ditto) */
+			   GetUserId(),/* owner's ID */
+			   -1,/* internal size (always varlena) */
+			   TYPTYPE_BASE,/* type-type (base type) */
+			   TYPCATEGORY_ARRAY,/* type-category (array) */
+			   fal

Re: [HACKERS] XTM & parallel search

2016-06-03 Thread Konstantin Knizhnik



On 03.06.2016 16:05, Amit Kapila wrote:
On Fri, Jun 3, 2016 at 1:34 AM, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:


We have to add three more functions to eXtensible Transaction
Manager API (XTM):

/*
 * Calculate transaction state size. This method is invoked by
EstimateTransactionStateSpace to copy transaction
 * state to parallel workers
 */
size_t  (*GetTransactionStateSize)(void);

/*
 * Serialize transaction state
 */
void(*SerializeTransactionState)(void* ctx);

/*
 * Deserialize transaction state
 */
void(*DeserializeTransactionState)(void* ctx);



In above proposal, are you suggesting to change the existing API's as 
well, because the parameters of function pointers don't match with 
exiting API's. I think it is better to consider this along with the 
overall XTM API.


Sorry, but right now I have not replaced existed functions
EstimateTransactionStateSpace/SerializeTransactionState/StartParallelWorkerTransaction
with XTM indirect calls. If was my original intention, but these 
functions access static variable CurrentTransactionState defined in xact.c.
So if  user-defined TM wants to override this functions, it will have to 
invoke original functions to save/restore CurrentTransactionState.

It is not  convenient.
This is why three XTM functions above are now called by existed xact 
funcations to save additional state, for example:


Size
EstimateTransactionStateSpace(void)
{
TransactionState s;
Sizenxids = 6;/* iso level, deferrable, top & 
current XID,

 * command counter, XID count */

for (s = CurrentTransactionState; s != NULL; s = s->parent)
{
if (TransactionIdIsValid(s->transactionId))
nxids = add_size(nxids, 1);
nxids = add_size(nxids, s->nChildXids);
}

nxids = add_size(nxids, nParallelCurrentXids);
nxids = mul_size(nxids, sizeof(TransactionId));
return add_size(nxids, TM->GetTransactionStateSize());
}

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[HACKERS] XTM & parallel search

2016-06-02 Thread Konstantin Knizhnik

We have to add three more functions to eXtensible Transaction Manager API (XTM):

/*
 * Calculate transaction state size. This method is invoked by 
EstimateTransactionStateSpace to copy transaction
 * state to parallel workers
 */
size_t  (*GetTransactionStateSize)(void);

/*
 * Serialize transaction state
 */
void(*SerializeTransactionState)(void* ctx);

/*
 * Deserialize transaction state
 */
void(*DeserializeTransactionState)(void* ctx);

The reason is that we find out that our multimaster is not correctly working when 
max_parallel_workers > 0
because multimaster transaction context is not properly shared between workers.
Unfortunately right now serialization/deserialization of transaction state is 
hardcoded in xact.c. and IMHO is done in quite ugly way:

XactIsoLevel = (int) tstate[0];
XactDeferrable = (bool) tstate[1];
XactTopTransactionId = tstate[2];
CurrentTransactionState->transactionId = tstate[3];
currentCommandId = tstate[4];
nParallelCurrentXids = (int) tstate[5];
ParallelCurrentXids = [6];

- there is even no declared structure with fixed part of saved context.
I wonder if not only DTM will be interested in sharing some common state 
between workers and should we  provide some way of replicating user defined 
context between workers? From my point of view XTM seems to be good place for 
it...



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array of domain types

2016-06-02 Thread Konstantin Knizhnik

On 02.06.2016 17:22, Tom Lane wrote:

konstantin knizhnik <k.knizh...@postgrespro.ru> writes:

Attached please find patch for DefineDomain function.

You didn't attach the patch,


Sorry, but I did attached the patch - I see the attachment in my mail 
received from the group.

Multidimensional arrays work fine:

knizhnik=# SELECT '{{14},{20}}'::teenager[][];
ERROR:  value for domain teenager violates check constraint "teenager_check"
LINE 1: SELECT '{{14},{20}}'::teenager[][];
   ^
knizhnik=# SELECT '{{14},{19}}'::teenager[][];
  teenager
-
 {{14},{19}}
(1 row)

knizhnik=# SELECT ('{{14},{19}}'::teenager[][])[1][1];
 teenager
--
   14
(1 row)


Domain of array of domain also works:


knizhnik=# create domain teenager_groups as teenager[];
CREATE DOMAIN
knizhnik=# SELECT '{{14},{19}}'::teenager_groups;
 teenager_groups
-
 {{14},{19}}
(1 row)

knizhnik=# SELECT '{{14},{20}}'::teenager_groups;
ERROR:  value for domain teenager violates check constraint "teenager_check"
LINE 1: SELECT '{{14},{20}}'::teenager_groups;



  but in any case, I would be astonished
if there is no work required beyond creating the matching array type.
The reverse case (domains over arrays) has half a dozen special cases
required to make it work smoothly.  Probably the considerations on this
side are totally different, but it's hard to believe there aren't any.

One case that seems likely to be pretty squishy is an array of a domain
over an array type.  One would wish to be able to do foo[2][4] to
extract an element of the contained array.  That won't work as-is
because the notation will be taken as a multi-dimensional subscript,
but I would expect that (foo[2])[4] should work.  Does it?  Does
ruleutils.c always include the necessary parens when reverse-listing
such a construct?  Is it possible to assign to such a sub-element,
and if so, do the domain constraints get checked properly?

Domain over an array that is of a domain type might be another fun
case.

regards, tom lane


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] array of domain types

2016-06-02 Thread konstantin knizhnik
On Jun 2, 2016, at 12:29 PM, Thom Brown wrote:On 2 June 2016 at 10:13, konstantin knizhnik <k.knizh...@postgrespro.ru> wrote:Yes, it doesn't work:# CREATE DOMAIN teenager AS int CHECK (VALUE BETWEEN 13 AND 19);CREATE DOMAIN# SELECT 14::teenager; teenager --   14(1 row)# SELECT 20::teenager;ERROR:  value for domain teenager violates check constraint "teenager_check"# SELECT '{14,20}'::teenager[]; teenager -- {14,20}(1 row)That last one should fail.Yes, I see.This approach was wrong.Attached please find patch for DefineDomain function.

domain.patch
Description: Binary data
Thom



Re: [HACKERS] array of domain types

2016-06-02 Thread konstantin knizhnik

On Jun 1, 2016, at 4:37 PM, Thom Brown wrote:

> On 1 June 2016 at 14:20, Konstantin Knizhnik <k.knizh...@postgrespro.ru> 
> wrote:
> I wonder why domain types can not be used for specification of array element:
> 
> create domain objref as bigint;
> create table foo(x objref[]);
> ERROR:  type "objref[]" does not exist
> create table foo(x bigint[]);
> CREATE TABLE
> 
> Is there some principle problem here or it is just not implemented?
> 
> It's not implemented, but patches welcome.
>  
> Thom

The patch is trivial:  just use typbasetype in get_array_type if typtype is 
TYPTYPE_DOMAIN:

diff --git a/src/backend/utils/cache/lsyscache.c 
b/src/backend/utils/cache/lsyscache.c
index cb26d79..ecfbb20 100644
--- a/src/backend/utils/cache/lsyscache.c
+++ b/src/backend/utils/cache/lsyscache.c
@@ -2486,7 +2486,18 @@ get_array_type(Oid typid)
if (HeapTupleIsValid(tp))
{
result = ((Form_pg_type) GETSTRUCT(tp))->typarray;
-   ReleaseSysCache(tp);
+   if (result == InvalidOid && ((Form_pg_type) 
GETSTRUCT(tp))->typtype == TYPTYPE_DOMAIN) {
+   typid = ((Form_pg_type) GETSTRUCT(tp))->typbasetype;
+   ReleaseSysCache(tp);
+   tp = SearchSysCache1(TYPEOID, ObjectIdGetDatum(typid));
+   if (HeapTupleIsValid(tp))
+   {
+   result = ((Form_pg_type) 
GETSTRUCT(tp))->typarray;
+   ReleaseSysCache(tp);
+   }
+   } else { 
+   ReleaseSysCache(tp);
+   }
}
return result;
 }


Any problems with it?



[HACKERS] array of domain types

2016-06-01 Thread Konstantin Knizhnik
I wonder why domain types can not be used for specification of array 
element:


create domain objref as bigint;
create table foo(x objref[]);
ERROR:  type "objref[]" does not exist
create table foo(x bigint[]);
CREATE TABLE

Is there some principle problem here or it is just not implemented?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Logical replication & oldest XID.

2016-05-31 Thread Konstantin Knizhnik

Hi,

We are using logical replication in multimaster and are faced with some 
interesting problem with "frozen" procArray->replication_slot_xmin.
This variable is adjusted by ProcArraySetReplicationSlotXmin which is 
invoked by ReplicationSlotsComputeRequiredXmin, which
is in turn is called by LogicalConfirmReceivedLocation. If transactions 
are executed at all nodes of multimaster, then everything works fine: 
replication_slot_xmin is advanced. But if we send transactions only to 
one multimaster node and broadcast this changes to other nodes, then no 
data is send through replications slot at this nodes. No data sends - no 
confirmations, LogicalConfirmReceivedLocation is not called and 
procArray->replication_slot_xmin preserves original value 599.


As a result GetOldestXmin function always returns 599, so autovacuum is 
actually blocked and our multimaster is not able to perform cleanup of 
XID->CSN map, which cause shared memory overflow. This situation happens 
only when write transactions are sent only to one node or if there are 
no write transactions at all.


Before implementing some workaround (for example forces all of 
ReplicationSlotsComputeRequiredXmin), I want to understand if it is real 
problem of logical replication or we are doing something wrong? BDR 
should be faced with the same problem if all updates are performed from 
one node...


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Inheritance

2016-05-24 Thread Konstantin Knizhnik
There is one aspect of inheritance support which was not mentioned: 
polymorphic queries.
Actually polymorphism is the fundamental feature of OOP, without it 
there is no behavioral inheritance and inheritance can be considered 
just as "syntax sugar" for sharing some common subset of attributes 
between tables.


The main problem with supporting polymorphic queries is that SQL query 
returns set of tuples, not set of objects.
So there is no nice way to return both instances of based and derived 
tables. There are several alternatives
(for example return joined set of attributes in all derived tables, 
leaving missed as NULLs) but none of them is good.


There is more specific problem - polymorphic function calls, see:

http://postgresql.nabble.com/Polymorphic-function-calls-td5784772.html

which in principle can be solved... But looks like there is no such 
intention.


I do not know how much people will be interested in normal OO support 
from DBMS.
Right now PostgreQSL inheritance is used for different purposes: 
partitioning, FDW-based replication or sharding, ...
But IMHO it is mostly because current inheritance implementation just 
not make it possible to use it for OOP.
As far as I know before there were a lot of discussions whether arrays, 
json and other non-traditional for RDBMS types is needed.
But right now them are quite popular and frequently used. The same thing 
can happen with OO support: it will be popular if provide necessary 
functionality.


Right now OO DBMS applications are mostly using ORMs and them are not 
using PostgreSQL inheritance.
Mostly because ORMs are used to be more flexible in providing different 
approaches for implementing inheritance (for example in most cases 
placing all derived classes in single table works better than approach 
based on appending results of several queries).
Another reason is that ORM usually support more than one DBMS, and since 
most of them have no built-in support for inheritance,

ORM has to implement inheritance itself.
But I do not think that presence of ORM excludes necessity to have 
internal support of OO in DBMS.




On 24.05.2016 16:51, Jim Nasby wrote:

On 5/23/16 10:30 PM, Craig Ringer wrote:

I find it notable that no popular ORM has bothered adopting PostgreSQL's
inheritance features, and instead just use big left joins or repeated
SELECTs to implement parent/child relationships, with foreign keys
enforcing constraints.


Since when do we consider ORMs to be an example of how to do good 
database design?


In this case, I'm sure no ORM uses the feature because either the 
ORM's authors have no clue it exists (except maybe for partitioning) 
or because we're the only mainstream database that has it.



I consider inheritance mostly useless without the ability to have UNIQUE
indexes that span a parent relation and all its children. You can use
them for partitioning only by sacrificing a bunch of integrity
protection or creating messy chains of FKs between individual 
partitions.


There's ways around this issue, but I agree that it's a very unhappy 
situation right now.



I'd rather like to quietly deprecate inheritance and eventually remove
it once we have real partitioning and some time has passed...


IMO that's a very unfortunate attitude to have for the "Worlds Most 
Advanced Open Source Database". Now that OO programming isn't treated 
as a magic bullet it's proven to be an extremely powerful tool, 
especially when used to encapsulate complex data. The 
ndarray->Series->DataFrame->Panel inheritance tree is a great example 
of this. Feature-wise, we need to be doing *more* of this kind of 
work, not less. Lack of support for OO paradigms was one of the 
drivers for NoSQL.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Just-in-time compiling things

2016-05-14 Thread Konstantin Knizhnik

On 05/14/2016 12:10 PM, Andreas Seltenreich wrote:

Konstantin Knizhnik writes:


Latest information from ISP RAS guys: them have made good progress
since February: them have rewritten most of methods of Scan, Aggregate
and Join to LLVM API.

Is their work available somewhere?  I'm experimenting in that area as
well, although I'm using libFirm instead of LLVM.  I wonder what their
motivation to rewrite backend code in LLVM IR was, since I am following
the approach of keeping the IR around when compiling the vanilla
postgres C code, possibly inlining it during JIT and then doing
optimizations on this IR.  That way the logic doesn't have to be
duplicated.


The work is not yet completed but finally it will be definitely put to open 
source.
I am going to talk a little bit about this project at PGcon in Ottawa at 
lighting talks, although I do not know details of the project myself.
The main difference of their approach comparing with Vitesse DB is that them 
implement a way of automatic conversion of PostgreSQL operators to LLVM IR.
So instead of rewritting ~2000 operators manually (a lot of work and errors), them implement converter which transform the code of this operators to ... C++ code producing LLVM IR. So manually them need to rewrite only plan nodes. Them already implemented 
most of nodes (SeqScan, Sort, HashJoin,...) which allows to execute all TPC-H queries. Result will be published soon. The larghest advantage is definitely at Q1 - about 4 times. It is worser than Vitesse DB (8 times) and with manually written operators (7 
times). The most probable reason causing such performance penalty is overflow checking: in manually written LLVM code it can be done in more efficient way using correspondent assembler instruction than in code automatically converted from standard C.

But ISP RAS guys are going to fix this problem and improve automatic conversion 
quality.

I include in CC members of ISP RAS team - you can ask them questions directly.





regrads
Andreas



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Academic help for Postgres

2016-05-12 Thread konstantin knizhnik

On May 12, 2016, at 6:16 AM, Rajeev rastogi wrote:

> On 11 May 2016 19:50, Bruce Momjian Wrote:
> 
> 
>> I am giving a keynote at an IEEE database conference in Helsinki next
>> week (http://icde2016.fi/).  (Yes, I am not attending PGCon Ottawa
>> because I accepted the Helsinki conference invitation before the PGCon
>> Ottawa date was changed from June to May).
>> 
>> As part of the keynote, I would like to mention areas where academia can
>> help us.  The topics I can think of are:
>> 
>>  Query optimization
>>  Optimizer statistics
>>  Indexing structures
>>  Reducing function call overhead
>>  CPU locality
>>  Sorting
>>  Parallelism
>>  Sharding
>> 
>> Any others?
> 
> How about?
> 1. Considering NUMA aware architecture.
> 2. Optimizer tuning as per new hardware trends.
> 3. More effective version of Join algorithms (e.g. Compare to traditional 
> "build and then probe" mechanism of Hash Join, now there is pipelining Hash 
> join where probe and build both happens together).

Interesting article about optimal joins: http://arxiv.org/pdf/1203.1952v1.pdf


> 
> Thanks and Regards,
> Kumar Rajeev Rastogi
> 
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Academic help for Postgres

2016-05-11 Thread Konstantin Knizhnik



On 11.05.2016 17:32, Bruce Momjian wrote:

On Wed, May 11, 2016 at 05:31:10PM +0300, Konstantin Knizhnik wrote:


On 11.05.2016 17:20, Bruce Momjian wrote:

I am giving a keynote at an IEEE database conference in Helsinki next
week (http://icde2016.fi/).  (Yes, I am not attending PGCon Ottawa
because I accepted the Helsinki conference invitation before the PGCon
Ottawa date was changed from June to May).

As part of the keynote, I would like to mention areas where academia can
help us.  The topics I can think of are:

Query optimization
Optimizer statistics
Indexing structures
Reducing function call overhead
CPU locality
Sorting
Parallelism
Sharding

Any others?


Incremental materialized views?

I don't know.  Is that something academics would research?


I am not sure.
There is definitely a question which views can be incrementally 
recalculated and which inductive extension has to be constructed to make 
it possible. If you google for "incremental materialized views phd", you 
will get a larger number of references to articles.

But I do not know if all question in this area are already closed or not...


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Academic help for Postgres

2016-05-11 Thread Konstantin Knizhnik



On 11.05.2016 17:20, Bruce Momjian wrote:

I am giving a keynote at an IEEE database conference in Helsinki next
week (http://icde2016.fi/).  (Yes, I am not attending PGCon Ottawa
because I accepted the Helsinki conference invitation before the PGCon
Ottawa date was changed from June to May).

As part of the keynote, I would like to mention areas where academia can
help us.  The topics I can think of are:

Query optimization
Optimizer statistics
Indexing structures
Reducing function call overhead
CPU locality
Sorting
Parallelism
Sharding

Any others?


Incremental materialized views?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Konstantin Knizhnik



On 10.05.2016 20:26, Robert Haas wrote:

At this moment (February) them have implemented translation of only few
PostgreSQL operators used by ExecQuals  and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code for
aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open
source...
Interesting.  You may notice that in "asynchronous mode" my prototype
works using a push model of sorts.  Maybe that should be taken
further.

Latest information from ISP RAS guys: them have made good progress since 
February: them have rewritten most of methods of Scan, Aggregate and 
Join to LLVM API. Also then implemented automatic translation of  
PostgreSQL backend functions to LLVM API.

As a result time of TPC-H Q1 query is reduced four times.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-11 Thread Konstantin Knizhnik



On 11.05.2016 17:00, Robert Haas wrote:

On Tue, May 10, 2016 at 3:42 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Doesn't this actually mean that we need to have normal job scheduler which
is given queue of jobs and having some pool of threads will be able to
orginize efficient execution of queries? Optimizer can build pipeline
(graph) of tasks, which corresponds to execution plan nodes, i.e. SeqScan,
Sort, ... Each task is splitted into several jobs which can be concurretly
scheduled by task dispatcher.  So you will not have blocked worker waiting
for something and all system resources will be utilized. Such approach with
dispatcher allows to implement quotas, priorities,... Also dispatches can
care about NUMA and cache optimizations which is especially critical on
modern architectures. One more reference:
http://db.in.tum.de/~leis/papers/morsels.pdf

I read this as a proposal to redesign the entire optimizer and
executor to use some new kind of plan.  That's not a project I'm
willing to entertain; it is hard to imagine we could do it in a
reasonable period of time without introducing bugs and performance
regressions.  I think there is a great deal of performance benefit
that we can get by changing things incrementally.

Yes, I agree with you that complete rewriting of optimizer is huge 
project with unpredictable influence on performance of some queries.
Changing things incrementally is good approach, but only if we are 
moving in right direction.
I still not sure that introduction of async. operations is step in right 
direction. Async.ops are used to significantly complicate code (since 
you have to maintain state yourself). It will be bad if implementation 
of each node has to deal with async state itself in its own manner.


My suggestion is to try to provide some generic mechanism for managing 
state transition and have some scheduler which controls this process. It 
should not be responsibility of node implementation to organize 
asynchronous/parallel execution. Instead of this it should just produce 
set of jobs which execution should  be controlled by scheduler. First 
implementation of scheduler can be quite simple. But later in can become 
more clever: try to bind data to processors and do many other 
optimizations.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread Konstantin Knizhnik

On 05/10/2016 08:26 PM, Robert Haas wrote:

On Tue, May 10, 2016 at 3:00 AM, konstantin knizhnik
<k.knizh...@postgrespro.ru> wrote:

What's wrong with it that worker is blocked? You can just have more workers
(more than CPU cores) to let other of them continue to do useful work.

Not really.  The workers are all running the same plan, so they'll all
make the same decision about which node needs to be executed next.  If
that node can't accommodate multiple processes trying to execute it at
the same time, it will have to block all of them but the first one.
Adding more processes just increases the number of processes sitting
around doing nothing.


Doesn't this actually mean that we need to have normal job scheduler which is given queue of jobs and having some pool of threads will be able to orginize efficient execution of queries? Optimizer can build pipeline (graph) of tasks, which corresponds to 
execution plan nodes, i.e. SeqScan, Sort, ... Each task is splitted into several jobs which can be concurretly scheduled by task dispatcher.  So you will not have blocked worker waiting for something and all system resources will be utilized. Such approach 
with dispatcher allows to implement quotas, priorities,... Also dispatches can care about NUMA and cache optimizations which is especially critical on modern architectures. One more reference: http://db.in.tum.de/~leis/papers/morsels.pdf


Sorry, may be I wrong, but I still think that async.ops is "multitasking for 
poor":)
Yes, maintaining threads and especially separate processes adds significant overhead. It leads to extra resources consumption and context switches are quite expensive. And I know from my own experience that replacing several concurrent processes performing 
some IO (for example with sockets) with just one process using multiplexing allows to increase performance. But still async. ops. is just a way to make programmer responsible for managing state machine instead of relying on OS tomake context switches. 
Manual transmission is still more efficient than automatic transmission. But still most drives prefer last one;)


Seriously, I carefully read your response to Kochei, but still not convinced 
that async. ops. is what we need.  Or may be we just understand different thing 
by this notion.





But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation
native code for query execution plan (which is not so difficult now, thanks
to LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL.
Instead of native code generation I used conversion of query plans to C code
and experiment with different data representation. "Horisontal model" with
loading columns on demands shows better performance than columnar store.

Yes, I think this approach should also be considered.


At this moment (February) them have implemented translation of only few
PostgreSQL operators used by ExecQuals  and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25%
increase at TPC-H Q1 (for Q1 most critical is generation of native code for
aggregates, because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code
generation, but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open
source...

Interesting.  You may notice that in "asynchronous mode" my prototype
works using a push model of sorts.  Maybe that should be taken
further.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] asynchronous and vectorized execution

2016-05-10 Thread konstantin knizhnik
Hi,

> 1. asynchronous execution,


It seems to me that asynchronous execution can be considered as alternative to 
multithreading model (in case of PostgreSQL the roles of threads are played by 
workers).
Async. operations are used to have smaller overhead but have scalability 
problems (because in most implementation of cooperative multitasking there is 
just one processing thread and so it can not consume more than one core).

So I wonder whether asynchronous execution is trying to achieve that same goal 
as parallel query execution but using slightly different mechanism.
You wrote: 
> in the meantime, any worker that arrives at that scan node has no choice but 
> to block.

What's wrong with it that worker is blocked? You can just have more workers 
(more than CPU cores) to let other of them continue to do useful work.
But I agree that 
> Whether or not this will be efficient is a research question




> 2. vectorized execution

Vector IO is very important for columnar store. In IMCS extension (in-memory 
columnar store) using vector operations allows to increase speed 10-100 times 
depending on size of data set and query. Obviously the best results are for 
grand aggregates.

But there are some researches, for example:

http://www.vldb.org/pvldb/vol4/p539-neumann.pdf

showing that the same or even better effect can be achieved by generation 
native code for query execution plan (which is not so difficult now, thanks to 
LLVM).
It eliminates interpretation overhead and increase cache locality.
I get similar results with my own experiments of accelerating SparkSQL. Instead 
of native code generation I used conversion of query plans to C code and 
experiment with different data representation. "Horisontal model" with loading 
columns on demands shows better performance than columnar store.

As far as I know native code generator is currently developed for PostgreSQL by 
ISP RAN 
Sorry, slides in Russian:
https://pgconf.ru/media/2016/02/19/6%20Мельник%20Дмитрий%20Михайлович,%2005-02-2016.pdf

At this moment (February) them have implemented translation of only few 
PostgreSQL operators used by ExecQuals  and do not support aggregates.
Them get about 2 times increase of speed at synthetic queries and 25% increase 
at TPC-H Q1 (for Q1 most critical is generation of native code for aggregates, 
because ExecQual itself takes only 6% of time for this query).
Actually these 25% for Q1 were achieved not by using dynamic code generation, 
but switching from PULL to PUSH model in executor.
It seems to be yet another interesting PostgreSQL executor transformation.
As far as I know, them are going to publish result of their work to open 
source...



On May 9, 2016, at 8:33 PM, Robert Haas wrote:

> Hi,
> 
> I realize that we haven't gotten 9.6beta1 out the door yet, but I
> think we can't really wait much longer to start having at least some
> discussion of 9.7 topics, so I'm going to go ahead and put this one
> out there.  I believe there are other people thinking about these
> topics as well, including Andres Freund, Kyotaro Horiguchi, and
> probably some folks at 2ndQuadrant (but I don't know exactly who).  To
> make a long story short, I think there are several different areas
> where we should consider major upgrades to our executor.  It's too
> slow and it doesn't do everything we want it to do.  The main things
> on my mind are:
> 
> 1. asynchronous execution, by which I mean the ability of a node to
> somehow say that it will generate a tuple eventually, but is not yet
> ready, so that the executor can go run some other part of the plan
> tree while it waits.  This case most obviously arises for foreign
> tables, where it makes little sense to block on I/O if some other part
> of the query tree could benefit from the CPU; consider SELECT * FROM
> lt WHERE qual UNION SELECT * FROM ft WHERE qual.  It is also a problem
> for parallel query: in a parallel sequential scan, the next worker can
> begin reading the next block even if the current block hasn't yet been
> received from the OS.  Whether or not this will be efficient is a
> research question, but it can be done.  However, imagine a parallel
> scan of a btree index: we don't know what page to scan next until we
> read the previous page and examine the next-pointer.  In the meantime,
> any worker that arrives at that scan node has no choice but to block.
> It would be better if the scan node could instead say "hey, thanks for
> coming but I'm really not ready to be on-CPU just at the moment" and
> potentially allow the worker to go work in some other part of the
> query tree.  For that worker to actually find useful work to do
> elsewhere, we'll probably need it to be the case either that the table
> is partitioned or the original query will need to involve UNION ALL,
> but those are not silly cases to worry about, particularly if we get
> native partitioning in 9.7.
> 
> 2. vectorized execution, by which I mean the ability of a node to
> return tuples in 

Re: [HACKERS] Batch update of indexes

2016-04-02 Thread Konstantin Knizhnik

On 04/02/2016 09:57 PM, Tom Lane wrote:

Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes:

Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
It is now able to handle all three possible situations:
1. Making index partial (add WHERE condition to the ordinary index)
2. Extend partial index range (less restricted index predicate)
3. Arbitrary change of partial index predicate

I've not been following this thread previously, but this proposal
scares me quite a lot.  I am certain there are places in our code
that assume that the properties of an index don't change after it's
been created.  One area that this almost certainly breaks is HOT updates:
adding a previously-unindexed column to an index predicate might break
existing HOT chains, and I see nothing in this patch that could deal
with that.  I seem to recall there are other places that would be
broken by changing an index's DDL definition after creation, but can't
recall specifics right now.

I am also, frankly, not seeing a use-case for this functionality that
would justify trying to find and remove those assumptions.

There's a lot of things I don't care for about the way the patch is
written, in particular its willingness to use SPI (which opens a lot of
potential for search-path problems, failure to see uncommitted tuples,
etc).  But we need not get to that if we don't believe the functionality
can work.


Thank you for review, Tom.

I completely agree with all your arguments against this patch.
I have proposed this patch mostly as prove of concept.
Yes, I have not take in account hot updates and may be there are other possible 
issues which I not considered.

The main question is whether the proposed way of batch update of indexes is 
viable or it is conceptually wrong approach
(because it beaks assumption that index properties can't be changed or because 
it is not convenient to use...).

I hope that everybody agree that maintaining of indexes is the main limiting 
factor for insert speed.
If table has no indexes, then insert speed can be as high as disk write speed 
(100Mb/sec or 1000 for SSD).
So if size of record is about 10 bytes, then we can get about 10 millions TPS.
But presence of indexes will dramatically change this picture: if database is 
large enough so that even index can not fit in memory
and records are inserted in random key order, then each insert in index will 
require reading of 3-4 pages from random locations on the disk.
With average HDD positioning time 10 msec, we get 100 reads per second and ... 
20-30 TPS. It is just with one index.
If we have 10 indexes, then TPS can be less than fingers on a hand.

Certainly it is very pessimistic estimation.
But still it is true that we can not provide good insert speed if we have to 
update indexes immediately.
And without indexes we can not efficiently execute most of queries.

I do not see any way in Postgres to solve this problem now. The hack with 
creating materialized views requires a lot of extra time and space.
It will not work for really large table.

So we need some way to postpone insertion of new records in the index. Then we can do such insertion in background or in idle time (at night), try to use bulk insert if index implementation supports it (for example sorting records by key before insert can 
significantly increase locality and so improve speed of insert in index). But the principle moment here is that such delayed update of index violates the main RDBMS rule that results of query execution with and without indexes should be the same. The trick 
with partial indexes allows to eliminate this contradiction. But it requires more actions from user. So are users ready to do some exatra job just because of "idealogical" reasons? Because if user wants to have delayed update of indexes, then he actually 
approves that it is ok for him that query results may not include some most recent updates.


Another aspect is which database objects are allowed to be altered and which 
not. Right now with tables we can alter almost everything.
With indexes - almost nothing. It is assumed that index can always be reconstructed. But for very big table reconstruction of indexes from scratch will take unacceptable amount of time. So should we make it possible to alter some index characteristics 
which do not require to rebuild index from scratch (and it is definitely true for partial index predicate)? Or price of supporting it is so high, that it can not be compensated by obtained benefits?


So what do you think?
1. Should I continue work in this direction and fix all possible issues with 
hot updates,... to make it possible to alter partial index predicates and 
support batch inserts i this way?
2. Or it is better to just add extra option to the index, allowing it to be slightly out-of-sync? It will allow, for example, to eliminate pending list for GIN which can cause very significant degradation of query speed, while for mos

[HACKERS] Very small patch for decode.c

2016-03-30 Thread Konstantin Knizhnik

diff --git a/src/backend/replication/logical/decode.c 
b/src/backend/replication/logical/decode.c
index 2380ea2..a992662 100644
--- a/src/backend/replication/logical/decode.c
+++ b/src/backend/replication/logical/decode.c
@@ -488,7 +488,7 @@ DecodeCommit(LogicalDecodingContext *ctx, XLogRecordBuffer 
*buf,
 {
XLogRecPtr  origin_lsn = InvalidXLogRecPtr;
TimestampTz commit_time = parsed->xact_time;
-   XLogRecPtr  origin_id = XLogRecGetOrigin(buf->record);
+   RepOriginId origin_id = XLogRecGetOrigin(buf->record);
int i;

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] avg,first,last,median in one query

2016-03-24 Thread Konstantin Knizhnik

Hi, hackers.

I need advice from SQL experts: is there any way in PostgreSQL to 
calculate avg,first,last,median aggregates in one query?

Assume that we have the following table:

create table Securities ("Symbol" varchar, "Date" date, "Time" time, 
"Price" real);


We can simulate median using percentile_disc:

select "Symbol","Date",
avg("Price"),
percentile_disc(0.5) within group (order by "Price")
from Securities
group by "Symbol","Date";

And all other aggregates can be calculated using windows functions:

select distinct "Symbol","Date",
first_value("Price") over (partition by "Symbol","Date" order by 
"Time" rows between unbounded preceding and unbounded following),
last_value("Price") over (partition by "Symbol","Date" order by 
"Time" rows between unbounded preceding and unbounded following),
avg("Price") over (partition by "Symbol","Date" rows between 
unbounded preceding and unbounded following)

from Securities;

I wonder is there are any simpler/efficient alternative to the query above?

But unfortunately it is not possible to calculate median is such way 
because percentile_disc is not compatible with OVER:


ERROR: OVER is not supported for ordered-set aggregate percentile_disc

So is there any chance to calculate all this four aggregates in one 
query without writing some supplementary functions?


Additional question: what is the most efficient way of calculating 
MEDIAN in PostgreSQL?

I found three different approaches:

1. Using CTE:

https://www.periscopedata.com/blog/medians-in-sql.html

2. Using user-defined aggregate function which uses array_appendand so 
materialize all values in memory:


https://wiki.postgresql.org/wiki/Aggregate_Median

3. Using percentile aggregate:

http://blog.jooq.org/2015/01/06/how-to-emulate-the-median-aggregate-function-using-inverse-distribution-functions/


Thanks in advance,

--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Applying logical replication changes by more than one process

2016-03-22 Thread konstantin knizhnik

On Mar 22, 2016, at 11:14 AM, Petr Jelinek wrote:
> 
> And each slot means connection with logical decoding attached to it so you 
> don't really want to have thousands of those anyway. I think you'll hit other 
> problems faster than loop over slots becomes problem if you plan to keep all 
> of them active.


Assume that cluster have thousands of nodes and we use sharding to scatter data 
through cluster nodes.
But to provide HA we want to perform sharding with some level of redundancy, 
for example save the same record at 3 different nodes.
Once possible approach (pg_shard) is to execute the same query at three 
different shards.
But there is no warranty that  result of execution will be the same at all 
nodes.
Alternative approach is to execute transaction at one node and then replicate 
it using logical replication to replicas.
So we do not perform logical replication to all 1000 nodes. Just to 2 of them. 
But each time it will be different pair of nodes. So we still need to have 1000 
active replication slots.

May be logical replication can not be used at all in such scenario - I have not 
thought much about it yet. Our first step will be multimaster without sharding.



> 
> -- 
>  Petr Jelinek  http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



Re: [HACKERS] Applying logical replication changes by more than one process

2016-03-22 Thread konstantin knizhnik

On Mar 22, 2016, at 10:10 AM, Craig Ringer wrote:

> On 22 March 2016 at 14:32, konstantin knizhnik <k.knizh...@postgrespro.ru> 
> wrote:
>  
>> Ah you mean because with wal_log=true the origin advance is in different WAL 
>> record than commit? OK yeah you might be one transaction behind then, true.
> 
> It actually means that we can not enforce database consistency. If we do 
> replorigin_advance  before commit and then crash happen, then we will loose 
> some changes.
> If we call replorigin_advance after commit but crash happen before, then some 
> changes can be applied multiple times. For example we can insert some record 
> twice (if there are no unique constraints).
> Look likes the only working scenario is to setup replication session for each 
> commit and use locking to prevent concurrent session setup for the same slot 
> by multiple process,  doesn't it?
> 
> Yes.
> 
> How would you expect it to work if you attempted to replorigin_advance 
> without a session? From multiple concurrent backends?

I would not work. But I wonder why I would need to call replorigin_advance 
without a session.
Please excuse me, I am not thinking now about the general case of using logical 
replication, I just focused on multimaster.
What I need is some efficient, durable and atomic mechanism for applying 
changes.
I see only two ways to provide atomicity:
1. Tracking of origins should be done inside xact as part of normal commit.
2. Use custom WAL records.

1) is easier now and it really works if I correctly synchronize access to 
slots. And surprisingly it even doesn't add substantial overhead.

> 
> Parallel apply is complicated business. You have to make sure you apply xacts 
> in an order that's free from deadlocks and from insert/delete anomalies - 
> though you can at least detect those, ERROR that xact and all subsequent 
> ones, and retry.

Well, this is exactly what our multimaster does. We do not try to enforce order 
of applying xacts. But we detect global deadlocks and use 2PC to provide data 
consistency.
So it is not task of logical replication, it is done by DTM overriding  
visibility checks and transaction commit protocol using XTM.


> For progress tracking to be consistent and correct you'd have to make sure 
> you committed strictly in the same order as upstream. Just before each commit 
> you can set the origin LSN and advance the replication origin, which will 
> commit atomically along with the commit it confirms. I don't really see the 
> problem.

Sorry, I do not completely understand you. What you mean by "will commit 
atomically along with the commit it confirms"? How this atomicity will be 
enforced?

>  
> I have tried it, fortunately it doesn't cause any noticeable performance 
> degradation. But unfortunately  can't consider such approach as elegant.
> Why it is actually necessary to bind replication slot to process? Why it is 
> not possible to have multiple concurrent sessions for the same slot?
> 
> Especially since most slot changes LWLock- and/or spinlock-protected already.
> 
> The client would have to manage replay confirmations appropriately so that it 
> doesn't confirm past the point where some other connection still needs it.
> 
> We'd have to expose a "slot" column in pg_stat_replication and remove the 
> "pid" column from pg_replication_slots to handle the 1:n relationship between 
> slot clients and slots, and it'd be a pain to show which normal user backends 
> were using a slot. Not really sure how to handle that.
> 
> To actually make this useful would require a lot more though. A way to 
> request that replay start from a new LSN without a full disconnect/reconnect 
> each time. Client-side parallel consume/apply. Inter-transaction ordering 
> information so the client can work out a viable xact apply order (possibly 
> using SSI information per the discussion with Kevin?). Etc.
> 
> I haven't really looked into this and I suspect there are some hairy areas 
> involved in replaying a slot from more than one client. The reason I'm 
> interested in it personally is for initial replica state setup as Oleksandr 
> prototyped and described earlier. We could attach to the slot's initial 
> snapshot then issue a new replication command that, given a table name or 
> oid, scans the table from the snapshot and passes each tuple to a new 
> callback (like, but not the same as, the insert callback) on the output 
> plugin.
> 
> That way clients could parallel-copy the initial state of the DB across the 
> same replication protocol they then consume new changes from, with no need to 
> make normal libpq connections and COPY initial state.
> 
> I'm interested in being able to do parallel receive of new changes from the 
> slot

Re: [HACKERS] Applying logical replication changes by more than one process

2016-03-22 Thread konstantin knizhnik

On Mar 21, 2016, at 4:30 PM, Petr Jelinek wrote:

> On 21/03/16 14:25, Andres Freund wrote:
>> On 2016-03-21 14:18:27 +0100, Petr Jelinek wrote:
>>> On 21/03/16 14:15, Andres Freund wrote:
> Only when the origin is actually setup for the current session. You
> need
> to call the replorigin_advance yourself from your apply code.
 
 That's problematic from a durability POV.
 
>>> 
>>> Huh? How come?
>> 
>> If you use the session mechanism the replication progress is synced with
>> the apply process, even if there are crashes. Crash recovery updates the
>> progress.  There's no such interlock with apply otherwise, and I don't
>> see how you can build one with reasonable effort.
>> 
> 
> Ah you mean because with wal_log=true the origin advance is in different WAL 
> record than commit? OK yeah you might be one transaction behind then, true.

It actually means that we can not enforce database consistency. If we do 
replorigin_advance  before commit and then crash happen, then we will loose 
some changes.
If we call replorigin_advance after commit but crash happen before, then some 
changes can be applied multiple times. For example we can insert some record 
twice (if there are no unique constraints).
Look likes the only working scenario is to setup replication session for each 
commit and use locking to prevent concurrent session setup for the same slot by 
multiple process,  doesn't it?
I have tried it, fortunately it doesn't cause any noticeable performance 
degradation. But unfortunately  can't consider such approach as elegant.
Why it is actually necessary to bind replication slot to process? Why it is not 
possible to have multiple concurrent sessions for the same slot?

Also I concern about using sequential search for slot location in 
replorigin_session_setup and many other functions - there is loop through all   
  max_replication_slots.
It seems to be not a problem when number of slots is less than 10. For 
multimaster this assumption is true - even Oracle RAC rarely has two-digit 
number of nodes.
But if we want to perform sharding and use logical replication for providing 
redundancy, then number of nodes and slots can be essentially larger.
I didn't think much about such configuration - may be it possible to propose 
more efficient mechanism for replication in this case.






> 
> -- 
>  Petr Jelinek  http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Applying logical replication changes by more than one process

2016-03-21 Thread Konstantin Knizhnik



On 21.03.2016 15:10, Petr Jelinek wrote:

Hi,

On 19/03/16 11:46, Konstantin Knizhnik wrote:

Hi,

I am trying to use logical replication mechanism in implementation of
PostgreSQL multimaster and faced with one conceptual problem.
Originally logical replication was intended to support asynchronous
replication. In this case applying changes by single process should not
be a bottleneck.
But if we are using distributed transaction manager to provide global
consistency, then applying transaction by one worker  leads to very bad
performance and what is worser: cause unintended serialization of
transactions, which is not taken in account by distributed deadlock
detection algorithm and so can cause
undetected deadlocks.

So I have implemented pool of background workers which can apply
transactions concurrently.
It works and shows acceptable performance. But now I am thinking about
HA and tracking origin LSNs which are needed to correctly specify slot
position in case of recovery. And there is a problem: as far as I
understand to correctly record origin LSN in WAL and advance slot
position it is necessary to setup session
using replorigin_session_setup. It is not so convenient in case of using
pool of background workers, because we have to setup session for each
commit.
But the main problem is that for each slot session can be associated
only with one process:

 else if (curstate->acquired_by != 0)
 {
 ereport(ERROR,
 (errcode(ERRCODE_OBJECT_IN_USE),
  errmsg("replication identifier %d is already active for
PID %d",
 curstate->roident, curstate->acquired_by)));
 }

Which once again means that there can be only one process applying 
changes.




That's not true, all it means is that you can do 
replorigin_session_setup for same origin only in one process but you 
don't need to have it setup for session to update it, the 
replorigin_advance() works just fine.


But RecordTransactionCommit is using replorigin_session_advance, not 
replorigin_advance.

And replorigin_session_advance requires that session was setup:

void
replorigin_session_advance(XLogRecPtr remote_commit, XLogRecPtr 
local_commit)

{
Assert(session_replication_state != NULL);
}

"session_replication_state" is private variable which is set by 
replorigin_session_setup.
But attempt to call replorigin_session_setup from multiple process cause 
above error.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Applying logical replication changes by more than one process

2016-03-19 Thread Konstantin Knizhnik

Hi,

I am trying to use logical replication mechanism in implementation of 
PostgreSQL multimaster and faced with one conceptual problem.
Originally logical replication was intended to support asynchronous 
replication. In this case applying changes by single process should not be a 
bottleneck.
But if we are using distributed transaction manager to provide global consistency, then applying transaction by one worker  leads to very bad performance and what is worser: cause unintended serialization of transactions, which is not taken in account by 
distributed deadlock detection algorithm and so can cause

undetected deadlocks.

So I have implemented pool of background workers which can apply transactions 
concurrently.
It works and shows acceptable performance. But now I am thinking about HA and tracking origin LSNs which are needed to correctly specify slot position in case of recovery. And there is a problem: as far as I understand to correctly record origin LSN in WAL 
and advance slot position it is necessary to setup session

using replorigin_session_setup. It is not so convenient in case of using pool 
of background workers, because we have to setup session for each commit.
But the main problem is that for each slot session can be associated only with 
one process:

else if (curstate->acquired_by != 0)
{
ereport(ERROR,
(errcode(ERRCODE_OBJECT_IN_USE),
 errmsg("replication identifier %d is already active for PID %d",
curstate->roident, curstate->acquired_by)));
}

Which once again means that there can be only one process applying changes.

To provide correct state of replication node it is necessary to enforce that each logical replication record is replayed exactly once: we should not loose some change or try to apply it twice. So operation of recording original LSN position in WAL and 
adjusting slot should be atomic. And during recovery we should restore slot current position based on the origin values  extracted from WAL. I wonder if it can be done using current logical replication mechanism when changes of each slot are applied by 
more than one process? Or the only alternative is to write/read origin LSNs in WAL myself, for example using custom WAL records?


Thanks in advance!

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch update of indexes

2016-03-14 Thread Konstantin Knizhnik

Hi David,

Rebased patch is attached.

On 14.03.2016 15:09, David Steele wrote:

Hi Konstantin,

On 2/3/16 11:47 AM, Konstantin Knizhnik wrote:

Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
It is now able to handle all three possible situations:
1. Making index partial (add WHERE condition to the ordinary index)
2. Extend partial index range (less restricted index predicate)
3. Arbitrary change of partial index predicate

In case 2) new records are added to the index.
In other two cases index is completely reconstructed.

This patch includes src/bin/insbench utility for testing insert
performance. It can be easily excluded from the patch to reduce it size.
Also it is better to apply this patch together with "index-only scans
with partial indexes" patch:


This patch no longer applies on master:

$ git apply ../other/alter-index.patch
../other/alter-index.patch:27: trailing whitespace.
static void
../other/alter-index.patch:62: indent with spaces.
SPIPlanPtr plan;
../other/alter-index.patch:63: indent with spaces.
Portal portal;
../other/alter-index.patch:90: trailing whitespace.

../other/alter-index.patch:99: trailing whitespace.

error: patch failed: src/test/regress/expected/aggregates.out:831
error: src/test/regress/expected/aggregates.out: patch does not apply

Please provide a new patch for review.  Meanwhile I am marking this 
"waiting on author".


Thanks,


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/alter_index.sgml b/doc/src/sgml/ref/alter_index.sgml
index ee3e3de..ab042ed 100644
--- a/doc/src/sgml/ref/alter_index.sgml
+++ b/doc/src/sgml/ref/alter_index.sgml
@@ -27,6 +27,7 @@ ALTER INDEX [ IF EXISTS ] name SET
 ALTER INDEX [ IF EXISTS ] name RESET ( storage_parameter [, ... ] )
 ALTER INDEX ALL IN TABLESPACE name [ OWNED BY role_name [, ... ] ]
 SET TABLESPACE new_tablespace [ NOWAIT ]
+ALTER INDEX name WHERE predicate 
 
  
 
@@ -109,6 +110,18 @@ ALTER INDEX ALL IN TABLESPACE name
 

 
+   
+WHERE predicate
+
+ 
+  Add or change predicate of partial index. Extending partial index predicate allows to implement batch update of index and so 
+  increase insert speed. New records (not matching index predicate) can be added to the table at maximal speed without affecting indexes.
+  Later, in background, indexes can be refreshed using ALTER INDEX ... WHERE ... clause.
+  See  for more discussion.
+ 
+
+   
+
   
   
 
diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index 13b04e6..a63de2a 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -34,6 +34,7 @@
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
@@ -52,6 +53,9 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/ruleutils.h"
+#include "executor/executor.h"
+#include "executor/spi.h"
 
 
 /* non-export function prototypes */
@@ -280,6 +284,160 @@ CheckIndexCompatible(Oid oldId,
 	return ret;
 }
 
+static void
+UpdateIndex(Oid indexRelationId, Node* whereClause)
+{
+	Datum		values[Natts_pg_index];
+	bool		isnull[Natts_pg_index];
+	HeapTuple   oldTuple;
+	HeapTuple   newTuple;
+	Relation	pg_index;
+
+	pg_index = heap_open(IndexRelationId, RowExclusiveLock);
+	oldTuple = SearchSysCacheCopy1(INDEXRELID, ObjectIdGetDatum(indexRelationId));
+	if (!HeapTupleIsValid(oldTuple))
+		elog(ERROR, "cache lookup failed for index %u", indexRelationId);
+
+	heap_deform_tuple(oldTuple, RelationGetDescr(pg_index), values, isnull);
+	values[Anum_pg_index_indpred - 1] = CStringGetTextDatum(nodeToString(whereClause));
+	isnull[Anum_pg_index_indpred - 1] = false;
+	newTuple = heap_form_tuple(RelationGetDescr(pg_index), values, isnull);
+	simple_heap_update(pg_index, >t_self, newTuple);
+	CatalogUpdateIndexes(pg_index, newTuple);
+	heap_freetuple(newTuple);
+	heap_freetuple(oldTuple);
+	heap_close(pg_index, NoLock);
+}
+
+void
+AlterIndex(Oid indexRelationId, IndexStmt *stmt)
+{
+	char* select;
+	Oid heapRelationId;
+	IndexUniqueCheck checkUnique;
+	Datum		values[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	Relation heapRelation;
+	Relation indexRelation;
+SPIPlanPtr plan;
+Portal portal;
+	HeapTuple tuple;
+	TupleTableSlot *slot;
+	ItemPointer tupleid;
+	IndexInfo  *indexInfo;
+	EState *estate;
+	Oid	namespaceId;
+	List*   deparseCtx;
+	char*   oldIndexPredicate;
+	char*   newIndexPredicate;
+	char*   relationName;
+
+	Assert(stmt->whereClause);
+	CheckPredicate((Expr *) stmt->whereClause);
+
+	/* Open and lock the pare

Re: [HACKERS] eXtensible Transaction Manager API (v2)

2016-03-11 Thread Konstantin Knizhnik
ter, so we have to introduce 
"in-doubt" state of transaction, when it is not known whether transaction is
committed or aborted and any other transaction accessing tuples updated but this 
transaction has to wait while its status is "in-doubt".
The main challenge of pg_tsdtm is to make this period as short as possible...

But it is details of particular implementation which IMHO have no relation to 
API itself.




3. Uh, how can you hook GetNewTransactionId but not ReadNewTransactionId?


Uh-uh-uh:)
ReadNewTransactionId is just reading value of ShmemVariableCache->nextXid,
but unfortunately it is not the only point where nextXid is used - there are 
about hundred occurrences of nextXid in Postgres core.
This is why we made a decision that GetNewTransactionId should actually update 
ShmemVariableCache->nextXid, so that
there is no need to rewrite all this code.
Sorry, but IMHO it is problem of Postgres design and not of XTM;)
We just want to find some compromise which allows XTM to be flexible enough but 
minimize changes in core code.


4. There seems to be an intention to encapsulate snapshots, but surely
wrapping hooks around GetSnapshotData and XidInMVCCSnapshot is not nearly
enough for that.  Look at all the knowledge snapmgr.c has about snapshot
representation, for example.  And is a function like GetOldestXmin even
meaningful with a different notion of what snapshots are?  (For that
matter, is TransactionId == uint32 still tenable for any other notion
of snapshots?)


XTM encapsulation of snapshots allows us to implement pg_dtm.
It does almost the same as Postgres-XL GTM, but without huge amount of #ifdefs.

Representation of XID is yet another compromise point: we do not want to change 
tuple header format.
So XID is still 32 bit and has the same meanining as in PostgreSQL. If custom 
implementation of TM wants to use some other identifiers of transactions,
like CSN in pg_tsdtm, it has to provide mapping between them and XIDs.




5. BTW, why would you hook at XidInMVCCSnapshot rather than making use of
the existing capability to have a custom SnapshotSatisfiesFunc snapshot
checker function?


HeapTupleSatisfies routines in times/tqual.c have implemented a lot of logic of 
handling different kind of snapshots, checking/setting hint bits in tuples,
caching,... We do not want to replace or just cut all this code in DTM 
implementation.
And XidInMVCCSnapshot is common function finally used by most 
HeapTupleSatisfies* functions when all other checks are passed.
So it is really the most convenient place to plug-in custom visibility checking 
rules. And as far as I remember similar approach was used in Postgres-XL.




IMO this is not committable as-is, and I don't think that it's something
that will become committable during this 'fest.  I think we'd be well
advised to boot it to the 2016-09 CF and focus our efforts on other stuff
that has a better chance of getting finished this month.
regards, tom lane



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizer questions

2016-03-10 Thread konstantin knizhnik

On Mar 10, 2016, at 1:56 AM, Tom Lane wrote:

> Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes:
>> I think that the best approach is to generate two different paths: 
>> original one, when projection is always done before sort and another one 
>> with postponed projection of non-trivial columns. Then we compare costs 
>> of two paths and choose the best one.
>> Unfortunately, I do not understand now how to implement it with existed 
>> grouping_planner.
>> Do you think that it is possible?
> 
> After fooling with this awhile, I don't think it's actually necessary
> to do that.  See attached proof-of-concept patch.

O, you did all my work:)

But right now the rule for cost estimation makes it not possible to apply this 
optimization for simple expressions like this:

postgres=# create table a (b integer);
CREATE TABLE
postgres=# insert into a values (generate_series(1, 10));
INSERT 0 10
postgres=# select b+b from a order by b limit 1;
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
NOTICE:  int4pl
 ?column? 
--
2
(1 row)
postgres=# create or replace function twice(x integer) returns integer as $$ 
begin raise notice 'exec function' ; return x + x ; end $$ language plpgsql;
CREATE FUNCTION
postgres=# select twice(b) from a order by b limit 1;   
NOTICE:  exec function
NOTICE:  int4pl
 twice 
---
 2
(1 row)


I wonder is there any advantages of earlier evaluation of such simple 
expressions if them are not needed for sort?
It seems to be only true for narrowing functions, like md5... But I think that 
it is quite exotic case, isn't it?
May be it is reasonable to be more optimistic in estimation cost of postponed 
expression evaluation?


Also I do not completely understand your concern about windows functions.
Is there any example of query with windows or aggregate functions when this 
optimization (postponing expression evaluation) can be applied?
It will be also interesting to me to know if there are some other cases (except 
SORT+LIMIT) when delaying projection leeds to more efficient plan.



> 
> Although this patch gets through our regression tests, that's only because
> it's conservative about deciding to postpone evaluation; if it tried to
> postpone evaluation in a query with window functions, it would fail
> miserably, because pull_var_clause doesn't know about window functions.
> I think that that's a clear oversight and we should extend it to offer
> the same sorts of behaviors as it does for Aggrefs.  But that would be
> slightly invasive, there being a dozen or so callers; so I didn't bother
> to do it yet pending comments on this patch.
> 
> I think it's probably also broken for SRFs in the tlist; we need to work
> out what semantics we want for those.  If we postpone any SRF to after
> the Sort, we can no longer assume that a query LIMIT enables use of
> bounded sort (because the SRF might repeatedly return zero rows).
> I don't have a huge problem with that, but I think now would be a good
> time to nail down some semantics.
> 
> Some other work that would be useful would be to refactor so that the
> cost_qual_eval operations aren't so redundant.  But that's just a small
> time savings not a question of functionality.
> 
> And we'd have to document that this changes the behavior for volatile
> functions.  For the better, I think: this will mean that you get
> consistent results no matter whether the query is implemented by
> indexscan or seqscan-and-sort, which has never been true before.
> But it is a change.
> 
> Do people approve of this sort of change in general, or this patch
> approach in particular?  Want to bikeshed the specific
> when-to-postpone-eval policies implemented here?  Other comments?
> 
>   regards, tom lane
> 
> diff --git a/src/backend/optimizer/plan/planner.c 
> b/src/backend/optimizer/plan/planner.c
> index 8937e71..b15fca1 100644
> *** a/src/backend/optimizer/plan/planner.c
> --- b/src/backend/optimizer/plan/planner.c
> *** static RelOptInfo *create_distinct_paths
> *** 126,131 
> --- 126,132 
> RelOptInfo *input_rel);
>  static RelOptInfo *create_ordered_paths(PlannerInfo *root,
>RelOptInfo *input_rel,
> +  PathTarget *target,
>double limit_tuples);
>  static PathTarget *make_group_input_target(PlannerInfo *root, List *tlist);
>  static List *postprocess_setop_tlist(List *new_tlist, List *orig_tlist);
> *** static PathTarget *make_window_input_tar
> *** 134,139 
> --- 135,142 
>

Re: [HACKERS] Optimizer questions

2016-03-09 Thread Konstantin Knizhnik



On 09.03.2016 09:15, Tom Lane wrote:

I wrote:

BTW, there's some additional refactoring I had had in mind to do in
grouping_planner to make its handling of the targetlist a bit more
organized; in particular, I'd like to see it using PathTarget
representation more consistently throughout the post-scan-join steps.

See 51c0f63e4d76a86b44e87876a6addcfffb01ec28 --- I think this gets
things to where we could plug in additional levels of targets without
too much complication.

regards, tom lane


So, if I correctly understand you, there are two major concerns:

1. Volatile functions. I wonder if it is really required to reevaluate 
volatile function for each record even if LIMIT clause is present?

Documentation says:
"A query using a volatile function will re-evaluate the function at 
every row where its value is needed."
So if we are using sort with limit and value of function is not used in 
sort, then we it is correct to say that value of this function is no 
needed, so there is no need to re-evaluate it, isn't it?


2. Narrowing functions, like md5.
Here I do not have any good idea how to support it. Looks like cost of 
SORT should depend on tuple width. Only in this case optimizer can 
determine whether it is more efficient to evaluate function earlier or 
postpone its execution.


I think that the best approach is to generate two different paths: 
original one, when projection is always done before sort and another one 
with postponed projection of non-trivial columns. Then we compare costs 
of two paths and choose the best one.
Unfortunately, I do not understand now how to implement it with existed 
grouping_planner.

Do you think that it is possible?

Alternative approach is to do something like in my proposed patch, but 
take in account cost of function execution and check presence of 
volatile/narrowing functions. This approach provides better flexibility, 
because we can choose subset of columns not-used in sort, which 
evaluation should be postponed. But here we once again make local 
decision while construction of the path instead of comparing costs of 
full paths.





--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizer questions

2016-03-08 Thread Konstantin Knizhnik

On 03/08/2016 07:01 AM, Tom Lane wrote:

Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes:

Attached please find improved version of the optimizer patch for LIMIT clause.

This patch isn't anywhere close to working after 3fc6e2d7f5b652b4.
(TBH, the reason I was negative about this upthread is that I had that
one in the oven and knew it would conflict spectacularly.)  I encourage
you to think about how an optimization of this sort could be made to
work in a non-kluge fashion in the new code structure.

I've not spent a lot of time on this, but I think maybe what would make
sense is to consider both the case where function calculations are
postponed to after ORDER BY and the case where they aren't, and generate
Paths for both.  Neither approach is a slam-dunk win.  For example,
suppose that one of the tlist columns is md5(wide_column) --- it will
likely not be preferable to pass the wide column data through the sort
step rather than reducing it to a hash first.  This would require some
work in grouping_planner to track two possible pathtargets, and work in
create_ordered_paths to generate paths for both possibilities.  A possible
objection is that this would add planning work even when no real benefit
is possible; so maybe we should only consider the new way if the tlist has
significant eval cost?  Not sure about that.  There is also something
to be said for the idea that we should try to guarantee consistent
semantics when the tlist contains volatile functions.

For now, I've set this commitfest entry to Waiting on Author.  There's
still time to consider a rewrite in this 'fest, if you can get it done
in a week or two.

regards, tom lane


Attached please find rebased patch.
Unfortunately 3fc6e2d7f5b652b4 still doesn't fix the problem with "lazy" 
evaluation of target list.
This is why my patch is still useful. But frankly speaking I am not sure that 
it is best way of fixing this problem,
because it takes in account only one case: sort+limit. May be the same 
optimization can be useful for other queries.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 5fc8e5b..709d1ad 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -126,8 +126,9 @@ static RelOptInfo *create_ordered_paths(PlannerInfo *root,
 	 RelOptInfo *input_rel,
 	 double limit_tuples);
 static PathTarget *make_scanjoin_target(PlannerInfo *root, List *tlist,
-	 AttrNumber **groupColIdx);
+		AttrNumber **groupColIdx, bool* splitted_projection);
 static int	get_grouping_column_index(Query *parse, TargetEntry *tle);
+static int	get_sort_column_index(Query *parse, TargetEntry *tle);
 static List *postprocess_setop_tlist(List *new_tlist, List *orig_tlist);
 static List *select_active_windows(PlannerInfo *root, WindowFuncLists *wflists);
 static List *make_windowInputTargetList(PlannerInfo *root,
@@ -1381,6 +1382,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 	RelOptInfo *current_rel;
 	RelOptInfo *final_rel;
 	ListCell   *lc;
+	bool splitted_projection = false;
 
 	/* Tweak caller-supplied tuple_fraction if have LIMIT/OFFSET */
 	if (parse->limitCount || parse->limitOffset)
@@ -1657,7 +1659,7 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 		 * that were obtained within query_planner().
 		 */
 		sub_target = make_scanjoin_target(root, tlist,
-		  );
+		  , _projection);
 
 		/*
 		 * Forcibly apply that tlist to all the Paths for the scan/join rel.
@@ -1801,6 +1803,13 @@ grouping_planner(PlannerInfo *root, bool inheritance_update,
 	{
 		Path	   *path = (Path *) lfirst(lc);
 
+		if (splitted_projection)
+		{			
+			path = apply_projection_to_path(root, current_rel,
+			path, create_pathtarget(root, tlist));
+		}
+
+
 		/*
 		 * If there is a FOR [KEY] UPDATE/SHARE clause, add the LockRows node.
 		 * (Note: we intentionally test parse->rowMarks not root->rowMarks
@@ -3775,15 +3784,17 @@ create_ordered_paths(PlannerInfo *root,
 static PathTarget *
 make_scanjoin_target(PlannerInfo *root,
 	 List *tlist,
-	 AttrNumber **groupColIdx)
+	 AttrNumber **groupColIdx,
+	 bool* splitted_projection)
 {
 	Query	   *parse = root->parse;
-	List	   *sub_tlist;
-	List	   *non_group_cols;
+	List	   *sub_tlist = NIL;
+	List	   *non_group_cols = NIL;
 	List	   *non_group_vars;
 	int			numCols;
 
 	*groupColIdx = NULL;
+	*splitted_projection = false;
 
 	/*
 	 * If we're not grouping or aggregating or windowing, there's nothing to
@@ -3791,14 +3802,66 @@ make_scanjoin_target(PlannerInfo *root,
 	 */
 	if (!parse->hasAggs && !parse->groupClause && !parse->groupingSets &&
 		!root->hasHavingQual && !parse->hasWindowFuncs)
+	{
+		if (parse->sortClause && l

Re: [HACKERS] The plan for FDW-based sharding

2016-03-07 Thread Konstantin Knizhnik
is especially true for DTM, because requirements of various cluster solution 
are very different.
And the most convenient way of doing it is to ship DTM as extension, not as 
some fork of Postgres.
It will greatly simplify using it.



Now, on connection pooling, I am similarly not opposed to
having some well-designed hooks, but I also think in the long run it
would be better for some improvements in this area to be part of core.
None of that means I would support any particular hook proposal, of
course.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The plan for FDW-based sharding

2016-03-02 Thread Konstantin Knizhnik



On 01.03.2016 22:02, Bruce Momjian wrote:

On Tue, Mar  1, 2016 at 07:56:58PM +0100, Petr Jelinek wrote:

Note that I am not saying that other discussed approaches are any
better, I am saying that we should know approximately what we
actually want and not just beat FDWs with a hammer and hope sharding
will eventually emerge and call that the plan.

I will say it again --- FDWs are the only sharding method I can think of
that has a chance of being accepted into Postgres core.  It is a plan,
and if it fails, it fails.  If is succeeds, that's good.  What more do
you want me to say?  I know of no other way to answer the questions you
asked above.


I do not understand why it can fail.
FDW approach may be not flexible enough for building optimal distributed 
query execution plans for complex OLAP queries.
But for simple queries it should work fine. Simple queries corresponds  
OLTP and simple OLAP.
For OLTP we definitely need transaction manager to provide global 
consistency.
And we have actually prototype of integration postgres_fdw with out 
pg_dtm and pg_tsdtm transaction managers.

The results are quite IMHO promising (see attached diagram).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



DTM-pgconf.pdf
Description: Adobe PDF document

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PROPOSAL: Fast temporary tables

2016-03-01 Thread Konstantin Knizhnik

As far as I know we are trying to kill two birds with one stone:
1. Reduce overhead of accessing temporary tables
2. Make it possible to create temporary tables on replica.

Replicas with hot-standby are widely used for running read-only OLAP queries.
But such queries usually stores intermediate results in temporary tables.
Unfortunately creating temporary table at read-only replica is impossible now.
So some customers do the following tricks: them create pool of file FDWs at 
master and then use them at replicas.
But IMHO it is ugly and inefficient hack.

Ideally we should be able to create temporary tables at replica, not affecting 
system catalog.
But there are a lot of problems: where it should be stores, how to assign XIDs 
to the ruples inserted in temporary table,...

Unfortunately, looks like there is no simple solution of the problem.
The 100% solution is multimaster (which we are currently developing), but it is 
completely different story...


On 03/01/2016 10:17 PM, Jim Nasby wrote:

On 3/1/16 10:05 AM, Atri Sharma wrote:

Fair point, that means inventing a whole new OID generation structure..


Generation is just the tip of the iceberg. You still need the equivalent to 
foreign keys (ie: pg_depend). While you would never have a permanent object 
depend on a temp object, the reverse certainly needs to be supported.

If I were attempting to solve this at a SQL level, I'd be thinking about using table inheritance such that the permanent objects are stored in a permanent parent. New backends would create UNLOGGED children off of that parent. There would be a pid column 
that was always NULL in the parent, but populated in children. That means children could use their own local form of an OID. When a backend terminates you'd just truncate all it's tables.


Actually translating that into relcache and everything else would be a serious 
amount of work.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The plan for FDW-based sharding

2016-03-01 Thread Konstantin Knizhnik

On 03/01/2016 09:19 PM, Petr Jelinek wrote:


Since this thread heavily discusses the XTM, I have question about the XTM as proposed because one thing is very unclear to me - what happens when user changes the XTM plugin on the server? I didn't see any xid handover API which makes me wonder if 
changes of a plugin (or for example failure to load previously used plugin due to admin error) will send server to similar situation as xid wraparound.



Transaction manager is very "intimate" part of DBMS and certainly bugs and 
problems in custom TM implementation can break the server.
So if you are providing custom TM implementation, you should take full 
responsibility on system integrity.
XTM API itself doesn't enforce any XID handling policy. As far as we do not 
want to change tuple header format, XID is still 32-bit integer.

In case of pg_dtm, global transactions at all nodes are assigned the same XID 
by arbiter. Arbiter is handling XID wraparound.
In pg_tsdtm each node maintains its own XIDs, actually pg_tsdtm doesn't change way of assigning CIDs by Postgres. So wraparound in this case is handled in standard way. Instead of assigning own global XIDs, pg_tsdtm provides mapping between local XIDs and 
global CSNs. Visibility checking rules looks on CSNs, not on XIDs.


In both cases if system is for some reasons restarted and DTM plugin failed to 
be loaded, you can still access database locally. No data can be lost.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The plan for FDW-based sharding

2016-03-01 Thread Konstantin Knizhnik



On 01.03.2016 19:03, Robert Haas wrote:

On Tue, Mar 1, 2016 at 10:37 AM, Bruce Momjian <br...@momjian.us> wrote:

On Tue, Mar  1, 2016 at 10:19:45AM -0500, Robert Haas wrote:

Two reasons:
1. There is no ideal implementation of DTM which will fit all possible needs
and be  efficient for all clusters.

Hmm, what is the reasoning behind that statement?  I mean, it is
certainly true that there are some places where we have decided that
one-size-fits-all is not the right approach.  Indexing, for example.

Uh, is that even true of indexing?  While the plug-in nature of indexing
allows for easier development and testing, does anyone create plug-in
indexing that isn't shipped by us?  I thought WAL support was something
that prevented external indexing solutions from working.

True.  There is an API, though, and having pluggable WAL support seems
desirable too.  At the same time, I don't think we know of anyone
maintaining a non-core index AM ... and there are probably good
reasons for that.  We end up revising the index AM API pretty
regularly every time somebody wants to do something new, so it's not
really a stable API that extensions can just tap into.  I suspect that
a transaction manager API would end up similarly situated.



IMHO non-stable API is better than lack of API.
Just because it makes it possible to implement features in modular way.
And refactoring of API is not so difficult thing...


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The plan for FDW-based sharding

2016-03-01 Thread Konstantin Knizhnik

Thank you very much for you comments.

On 01.03.2016 18:19, Robert Haas wrote:

On Sat, Feb 27, 2016 at 2:29 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

How do you prevent clock skew from causing serialization anomalies?

If node receives message from "feature" it just needs to wait until this
future arrive.
Practically we just "adjust" system time in this case, moving it forward
(certainly system time is not actually changed, we just set correction value
which need to be added to system time).
This approach was discussed in the article:
http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf
I hope, in this article algorithm is explained much better than I can do
here.

Hmm, the approach in that article is very interesting, but it sounds
different than what you are describing - they do not, AFAICT, have
anything like a "correction value"


In the article them used anotion "wait":

if T.SnapshotTime>GetClockTime()
then wait until T.SnapshotTime<GetClockTime()

Originally we really do sleep here, but then we think that instead of 
sleeping we can just adjust local time.
Sorry, I do not have format prove it is equivalent but... at least we 
have not encountered any inconsistencies after this fix and performance 
is improved.



There are well know limitation of this  pg_tsdtm which we will try to
address in future.

How well known are those limitations?  Are they documented somewhere?
Or are they only well-known to you?

Sorry, well know for us.
But them are described at DTM wiki page.
Right now pg_tsdtm is not supporting correct distributed deadlock 
detection (is not building global lock graph) and is detecting 
distributed deadlocks just based on timeouts.
It doesn't support explicit locks but "select for update" will work 
correctly.



What we want is to include XTM API in PostgreSQL to be able to continue our
experiments with different transaction managers and implementing multimaster
on top of it (our first practical goal) without affecting PostgreSQL core.

If XTM patch will be included in 9.6, then we can propose our multimaster as
PostgreSQL extension and everybody can use it.
Otherwise we have to propose our own fork of Postgres which significantly
complicates using and maintaining it.

Well I still think what I said before is valid.  If the code is good,
let it be a core submission.  If it's not ready yet, submit it to core
when it is.  If it can't be made good, forget it.


I have nothing against committing DTM code in core. But still the best 
way of integration it is to use a-la-OO approach.
So still need API. Inserting if-s or switches in existed code is IMHO 
ugly idea.


Also it is not enough for DTM code to be just "good". It should provide 
expected functionality.
But which functionality is expected? From my experience of development 
different cluster solutions I can say that
different customers have very different requirements. It is very hard if 
ever possible to satisfy them all.


Right now I do not feel that I can predict all possible requirements to DTM.
This is why we want to provide some API, propose some implementations of 
this API, receive feedbecks and get better understanding which 
functionality is actually needed by customers.






This seems rather defeatist.  If the code is good and reliable, why
should it not be committed to core?

Two reasons:
1. There is no ideal implementation of DTM which will fit all possible needs
and be  efficient for all clusters.

Hmm, what is the reasoning behind that statement?  I mean, it is
certainly true that there are some places where we have decided that
one-size-fits-all is not the right approach.  Indexing, for example.
But there are many other places where we have not chosen to make
things pluggable, and that I don't think it should be taken for
granted that plugability is always an advantage.

I fear that building a DTM that is fully reliable and also
well-performing is going to be really hard, and I think it would be
far better to have one such DTM that is 100% reliable than two or more
implementations each of which are 99% reliable.


The question is not about it's reliability, but mostly about its 
functionality and flexibility.



2. Even if such implementation exists, still the right way of it integration
is Postgres should use kind of TM API.

Sure, APIs are generally good, but that doesn't mean *this* API is good.


Well, I do not what to say "better than nothing", but I find this API to 
be a reasonable compromise between flexibility and minimization of 
changes in PostgreSQL core. If you have some suggestions how to improve 
it,  I will be glad to receive them.





I hope that everybody will agree that doing it in this way:

#ifdef PGXC
 /* In Postgres-XC, stop timestamp has to follow the timeline of GTM
*/
 xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp;
#else
 xlrec.xact_ti

Re: [HACKERS] The plan for FDW-based sharding

2016-02-28 Thread Konstantin Knizhnik

On 02/27/2016 11:38 PM, Kevin Grittner wrote:


Is this an implementation of some particular formal technique?  If
so, do you have a reference to a paper on it?  I get the sense that
there has been a lot written about distributed transactions, and
that it would be a mistake to ignore it, but I have not (yet)
reviewed the literature for it.


The reference to the article is at our WiKi pages explaining our DTM: 
https://wiki.postgresql.org/wiki/DTM

http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The plan for FDW-based sharding

2016-02-27 Thread Konstantin Knizhnik

Neither pg_dtm, neither pg_tsdtm supports serializable isolation level.
We implemented distributed snapshot isolation - repeatable-read isolation level.
We also do not support read-committed isolation level now.

We do not try to preserve transaction commit order at all nodes.
But in principle it can be implemented using XTM API: it allows to redefine 
function which actually sets transaction status.  pg_dtm performs 2PC here.
And in principle it is possible to enforce commits in any particular order.

Concerning CSNs, may be you are right and it is not correct to use this notion in this 
case. Actually there are many "CSNs" involved in transaction commit.
First of all each transaction is assigned local CSN (timestamp) when it is 
ready to commit. Then CSNs of all nodes are exchanged and maximal CSN is chosen.
This maximum is writen as final transaction CSN and is used in visibility check.


On 02/27/2016 01:48 AM, Kevin Grittner wrote:

On Fri, Feb 26, 2016 at 2:19 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:


pg_tsdtm  is based on another approach: it is using system time
as CSN

Which brings up an interesting point, if we want logical
replication to be free of serialization anomalies for those using
serializable transactions, we need to support applying transactions
in an order which may not be the same as commit order -- CSN (as
such) would be the wrong thing.  If serializable transaction 1 (T1)
modifies a row and concurrent serializable transaction 2 (T2) reads
the old version of the row, and modifies something based on that,
T2 must be applied to a logical replica first even if T1 commits
before it; otherwise the logical replica could see a state not
consistent with business rules and which could not have been seen
(due to SSI) on the source database.  Any DTM API which does not
support some mechanism to rearrange the order of transactions from
commit order to some other order (based on, for example, read-write
dependencies) is not complete.  If it does support that, it gives
us a way forward for presenting consistent data on logical
replicas.

To avoid confusion, it might be best to reserve CSN for actual
commit sequence numbers, or at least values which increase
monotonically with each commit.  The term of art for what I
described above is "apparent order of execution", so maybe we want
to use AOE or AOoE for the order we choose to use in a particular
implementation.  It doesn't seem to me to be outright inaccurate
for cases where the system time on the various systems is used.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The plan for FDW-based sharding

2016-02-27 Thread Konstantin Knizhnik

On 02/27/2016 06:54 AM, Robert Haas wrote:

On Fri, Feb 26, 2016 at 10:56 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

We do not have formal prove that proposed XTM is "general enough" to handle
all possible transaction manager implementations.
But there are two general ways of dealing with isolation: snapshot based and
CSN  based.

I don't believe that for a minute.  For example, consider this article:


Well, I have to agree that saying that there are just two ways of providing 
distributed isolation I was not right.
There is at least one more method: conservative locking. But it will cause huge 
number of extra network messages which has to be exchanged.
Also I mostly considered solutions compatible with PostgreSQL MVCC model.

And definitely their are other approaches. Like preserving transaction commit 
order (as it is done in Galera).
Some other them can be implemented with XTM (preserving commit order), some - 
not (2PL).
I have already noticed that XTM is not allowing to implement ANY transaction 
manager.
But we have considered several approaches to distributed transaction management 
explained in the article related with really working systems.
Some of them are real production system as SAP HANA, some are just prototypes, 
but working prototypes for which authors have performed
some benchmarking and comparison with other approaches. The references you have 
mentioned are mostly theoretical description of the problem.
Nice to know it but it is hard to build some concrete implementation based on 
this articles.


Briefly answering other your questions:


For example, consider a table with a million rows spread across any number of 
servers.


It is sharding scenario, pg_tsdtm will work well in this case does not 
requiring sending a lot of extra messages.


Now consider another workload where each transaction reads a row one

one server, reads a row on another server,

It can be solved both with pg_dtm (central arbiter) and pg_tsdtm (no arbiter),
But actually you scenarios just once again proves that there can not be just 
one ideal distributed TM.


So maybe the goal for the GTM isn't to provide true serializability

across the cluster but some lesser degree of transaction isolation.
But then exactly which serialization anomalies are we trying to
prevent, and why is it OK to prevent those and not others?

Absolutely agree. There are some theoretical discussion regarding CAP and 
different distributed level of isolation.
But at practice people want to solve their tasks. Most of PostgeSQL used are using 
default isolation level: read committed although there are alot of "wonderful" 
anomalies with it.
Serialazable transaction in Oracle are actually violating fundamental 
serializability rule and still Oracle is one of ther most popular database in 
the world...
The was isolation bug in Postgres-XL which doesn't prevent from using it by 
commercial customers...

So I do not say that discussing all this theoretical questions is not need as 
formally proven correctness of distributed algorithm.
But I do not understand hot why it should prevent from providing extensible TM 
API.
Yes, we can tot do everything with it. But still we can implement many 
different approaches.
I think that it somehow proves that it is "general enough".






 






https://en.wikipedia.org/wiki/Global_serializability

I think the neutrality of that article is *very* debatable, but it
certainly contradicts the idea that snapshots and CSNs are the only
methods of achieving global serializability.

Or consider this lecture:

http://hssl.cs.jhu.edu/~randal/416/lectures.old/ln5.2.pdf

That's a great introduction to the problem we're trying to solve here,
but again, snapshots are not mentioned, and CSNs certainly aren't
mentioned.

This write-up goes further, explaining three different methods for
ensuring global serializability, none of which mention snapshots or
CSNs:

http://heaven.eee.metu.edu.tr/~vision/LectureNotes/EE442/Ee442ch7.html

Actually, I think the second approach is basically a snapshot/CSN-type
approach, but it doesn't use that terminology and the connection to
what you are proposing is very unclear.

I think you're approaching this problem from a viewpoint that is
entirely too focused on the code that exists in PostgreSQL today.
Lots of people have done lots of academic research on how to solve
this problem, and you can't possibly say that CSNs and snapshots are
the only solution to this problem unless you haven't read any of those
papers.  The articles above aren't exceptional in mentioning neither
of the approaches that you are advocating - they are typical of the
literature in this area.  How can it be that the only solutions to
this problem are ones that are totally different from the approaches
that university professors who spend time doing research on
concurrency have spent time exploring?

I think we need to back up here and examine our underlyin

Re: [HACKERS] Relation cache invalidation on replica

2016-02-26 Thread Konstantin Knizhnik

On 02/27/2016 04:16 AM, Simon Riggs wrote:

On 27 February 2016 at 00:33, Simon Riggs <si...@2ndquadrant.com 
<mailto:si...@2ndquadrant.com>> wrote:

On 27 February 2016 at 00:29, Andres Freund <and...@anarazel.de 
<mailto:and...@anarazel.de>> wrote:

On 2016-02-26 18:05:55 +0300, Konstantin Knizhnik wrote:
> The reason of the problem is that invalidation messages are not 
delivered to
> replica after the end of concurrent create index.
> Invalidation messages are included in xlog as part of transaction 
commit
> record.
> Concurrent index create is split into three transaction, last of 
which is
> just performing inplace update of index tuple, marking it as valid and
> invalidating cache. But as far as this transaction is not assigned 
XID, no
> transaction record is created in WAL and send to replicas. As a 
result,
> replica doesn't receive this invalidation messages.

Ugh, that's a fairly ugly bug.


Looking now.


If the above is true, then the proposed fix wouldn't work either.

No point in sending a cache invalidation message on the standby if you haven't 
also written WAL, since the catalog re-read would just see the old row.

heap_inplace_update() does write WAL, which blows away the starting premise.

So I'm not seeing this as an extant bug in an open source version of 
PostgreSQL, in my current understanding.



Inplace update really creates record in WAL and this is why index is marked as 
valid at replica.
But invalidation messages are sent only with transaction commit record and such 
record is not created in this case,
because there is no assigned XID.

This is a real bug which originally observed by one of our customers with 
different versions of Postgres (last one them have tried was 9.5.1).
Then we reproduced it locally and determined the reason of the problem.
Repro scenario is very simple: you just need to create large enough table 
(pgbench with scale factor 100 works well in my case)
so that "create index concurrently" takes substantial amount of time. If, while 
this statement is in progress, you will execute some query at replica which
uses this index, then it will cache state of relation without index. And after 
even when index is actually constructed, it will never be used in this backend 
(but other backends at replica will use it).

I am not sure about the best way of fixing the problem.
I have not tested Andreas proposal:

if (nrels != 0 || nmsgs != 0 || RelcacheInitFileInval)

if it actually fixes the problem.
Assigning XID in heap_inplace_update definitely should work.
It is better than forcing assignment XID in DefineIndex? I am not sure, because 
this problem seems to be related only with concurrent update
(but may be I am wrong).
At least not all inplace updates should cause catalog invalidation and so 
require XID assignment.




--
Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] The plan for FDW-based sharding

2016-02-26 Thread Konstantin Knizhnik

On 02/27/2016 06:57 AM, Robert Haas wrote:

On Sat, Feb 27, 2016 at 1:49 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

pg_tsdtm  is based on another approach: it is using system time as CSN and
doesn't require arbiter. In theory there is no limit for scalability. But
differences in system time and necessity to use more rounds of communication
have negative impact on performance.

How do you prevent clock skew from causing serialization anomalies?


If node receives message from "feature" it just needs to wait until this future 
arrive.
Practically we just "adjust" system time in this case, moving it forward 
(certainly system time is not actually changed, we just set correction value which need 
to be added to system time).
This approach was discussed in the article:
http://research.microsoft.com/en-us/people/samehe/clocksi.srds2013.pdf
I hope, in this article algorithm is explained much better than I can do here.

Few notes:
1. I can not prove that our pg_tsdtm absolutely correctly implements approach 
described in this article.
2. I didn't try to formally prove that our implementation can not cause some 
serialization anomalies.
3. We just run various synchronization tests (including simplest debit-credit 
test which breaks old version of Postgtes-XL) during several days and we didn't 
get any inconsistencies.
4. We have tested pg_tsdtm both at single node, blade cluster and geographically distributed nodes (distance more than thousand kilometers: one server was in Vladivostok, another in Kaliningrad). Ping between these two servers takes about 100msec. 
Performance of our benchmark drops about 100 times but there was no inconsistencies.


Also I once again want to notice that primary idea of the proposed patch was 
not pg_tsdtm.
There are well know limitation of this  pg_tsdtm which we will try to address 
in future.
What we want is to include XTM API in PostgreSQL to be able to continue our 
experiments with different transaction managers and implementing multimaster on 
top of it (our first practical goal) without affecting PostgreSQL core.

If XTM patch will be included in 9.6, then we can propose our multimaster as 
PostgreSQL extension and everybody can use it.
Otherwise we have to propose our own fork of Postgres which significantly 
complicates using and maintaining it.


So there is no ideal solution which can work well for all cluster. This is
why it is not possible to develop just one GTM, propose it as a patch for
review and then (hopefully) commit it in Postgres core. IMHO it will never
happen. And I do not think that it is actually needed. What we need is a way
to be able to create own transaction managers as Postgres extension not
affecting its  core.

This seems rather defeatist.  If the code is good and reliable, why
should it not be committed to core?


Two reasons:
1. There is no ideal implementation of DTM which will fit all possible needs 
and be  efficient for all clusters.
2. Even if such implementation exists, still the right way of it integration is 
Postgres should use kind of TM API.
I hope that everybody will agree that doing it in this way:

#ifdef PGXC
/* In Postgres-XC, stop timestamp has to follow the timeline of GTM */
xlrec.xact_time = xactStopTimestamp + GTMdeltaTimestamp;
#else
xlrec.xact_time = xactStopTimestamp;
#endif

or in this way:

xlrec.xact_time = xactUseGTM ? xactStopTimestamp + GTMdeltaTimestamp : 
xactStopTimestamp;

is very very bad idea.
In OO programming we should have abstract TM interface and several 
implementations of this interface, for example
MVCC_TM, 2PL_TM, Distributed_TM...
This is actually what can be done with our XTM API.
As far as Postgres is implemented in C, not in C++ we have to emulate 
interfaces using structures with function pointers.
And please notice that there is completely no need to include DTM 
implementation in core, as far as it is not needed for everybody.
It can be easily distributed as extension.

I have that quite soon we can propose multimaster extension which should provides functionality similar with MySQL Gallera. But even right now we have integrated pg_dtm and pg_tsdtm with pg_shard and postgres_fdw, allowing to provide distributed 
consistency for them.






All arguments against XTM can be applied to any other extension API in
Postgres, for example FDW.
Is it general enough? There are many useful operations which currently are
not handled by this API. For example performing aggregation and grouping at
foreign server side.  But still it is very useful and flexible mechanism,
allowing to implement many wonderful things.

That is true.  And everybody is entitled to an opinion on each new
proposed hook, as to whether that hook is general or not.  We have
both accepted and rejected proposed hooks in the past.




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers m

Re: [HACKERS] The plan for FDW-based sharding

2016-02-26 Thread Konstantin Knizhnik

On 02/26/2016 09:30 PM, Alvaro Herrera wrote:

Konstantin Knizhnik wrote:


Yes, it is certainly possible to develop cluster by cloning PostgreSQL.
But it cause big problems both for developers, which have to permanently
synchronize their branch with master,
and, what is more important, for customers, which can not use standard
version of PostgreSQL.
It may cause problems with system certification, with running Postgres in
cloud,...
Actually the history of Postgres-XL/XC and Greenplum IMHO shows that it is
wrong direction.

That's not the point, though.  I don't think a Postgres clone with a GTM
solves any particular problem that's not already solved by the existing
forks.  However, if you have a clone at home and you make a GTM work on
it, then you take the GTM as a patch and post it for discussion.
There's no need for hooks for that.  Just make sure your GTM solves the
problem that it is supposed to solve.

Excuse me if I've missed the discussion elsewhere -- why does
PostgresPro have *two* GTMs instead of a single one?


There are many different clusters which require different approaches for 
managing distributed transactions.
Some clusters do no need distributed transactions at all: if you are executing 
OLAP queries on read-only database GTM will  just add extra overhead.

pg_dtm uses centralized arbiter. It is similar with Postgres-XL DTM. Presence of single arbiter signficantly simplify all distributed algorithms: failure detection, global deadlock elimination, ... But at the same time arbiter is SPOF and main factor 
limiting cluster scalability.


pg_tsdtm  is based on another approach: it is using system time as CSN and doesn't require arbiter. In theory there is no limit for scalability. But differences in system time and necessity to use more rounds of communication have negative impact on 
performance.


So there is no ideal solution which can work well for all cluster. This is why it is not possible to develop just one GTM, propose it as a patch for review and then (hopefully) commit it in Postgres core. IMHO it will never happen. And I do not think that 
it is actually needed. What we need is a way to be able to create own transaction managers as Postgres extension not affecting its  core.


All arguments against XTM can be applied to any other extension API in 
Postgres, for example FDW.
Is it general enough? There are many useful operations which currently are not handled by this API. For example performing aggregation and grouping at foreign server side.  But still it is very useful and flexible mechanism, allowing to implement many 
wonderful things.


From my point of view good system should be as open and customizable as 
possible, if it doesn't affect  performance.
Replacing direct function calls with indirect function calls in almost all 
cases can not suffer performance as well as adding hooks.
So without any extra price we get better flexibility. What's wrong with it?






--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] The plan for FDW-based sharding

2016-02-26 Thread Konstantin Knizhnik
We do not have formal prove that proposed XTM is "general enough" to 
handle all possible transaction manager implementations.
But there are two general ways of dealing with isolation: snapshot based 
and CSN  based.

pg_dtm and pg_tsdtm prove that both of them can be implemented using XTM.
If you know some approach to distributed transaction manager 
implementation, please let us know.

Otherwise your statement "is not general enough" is not concrete enough.
Postgres-XL GTM can be in principle implemented as extension based on XTM.

This API is based on existed PostgreSQL TM functions: we do not 
introduce some new abstractions.
Is it possible that some other TM function has to be encapsulated? Yes, 
it is.
But I do not see much problems with adding this function to XTM in 
future if it is actually needed.
It happens with most APIs. It is awful when API functions are changed, 
breaking application based on this API.
But as far as functions encapsulated in XTM are in any case present in 
PostgreSQL core, I do not think
that them will be changed in future unless there are some plans to 
completely rewrite Postgres transaction manager...


Yes, it is certainly possible to develop cluster by cloning PostgreSQL.
But it cause big problems both for developers, which have to permanently 
synchronize their branch with master,
and, what is more important, for customers, which can not use standard 
version of PostgreSQL.
It may cause problems with system certification, with running Postgres 
in cloud,...
Actually the history of Postgres-XL/XC and Greenplum IMHO shows that it 
is wrong direction.




On 26.02.2016 19:06, Robert Haas wrote:

On Fri, Feb 26, 2016 at 7:21 PM, Oleg Bartunov <obartu...@gmail.com> wrote:

Right now tm is hardcoded and it's doesn't matter  "if other people might
need" at all.  We at least provide developers ("other people")  ability to
work on their implementations and the patch  is safe and doesn't sacrifices
anything in core.

I don't believe that.  When we install APIs into core, we're
committing to keep those APIs around.  And I think that we're far too
early in the development of transaction managers for PostgreSQL to
think that we know what APIs we want to commit to over the long term.


And what makes us think we
really need multiple transaction managers, anyway?

If you brave enough to say that one tm-fits-all and you are able to teach
existed tm to play well  in various clustering environment during
development period, which is short, than probably we don't need  multiple
tms. But It's too perfect to believe and practical solution is to let
multiple groups to work on their solutions.

Nobody's preventing multiple groups for working on their solutions.
That's not the question.  The question is why we should install hooks
in core at this early stage without waiting to see which
implementations prove to be best and whether those hooks are actually
general enough to cater to everything people want to do.  There is
talk of integrating XC/XL work into PostgreSQL; it has a GTM.
Postgres Pro has several GTMs.  Maybe there will be others.

Frankly, I'd like to see a GTM in core at some point because I'd like
everybody who uses PostgreSQL to have access to a GTM.  What I don't
want is for every PostgreSQL company to develop its own GTM and
distribute it separately from everybody else's.  IIUC, MySQL kinda did
that with storage engines and it resulted in the fragmentation of the
community.  We've had the same thing happen with replication tools -
every PostgreSQL company develops their own set.  It would have been
better to have ONE set that was distributed by the core project so
that we didn't all do the same work over again.

I don't understand the argument that without these hooks in core,
people can't continue to work on this.  It isn't hard to work on GTM
without any core changes at all.  You just patch your copy of
PostgreSQL.  We do this all the time, for every patch.  We don't add
hooks for every patch.


dtms.  It's time to start working on dtm, I believe. The fact you don't
think about distributed transactions support doesn't mean there no "other
people", who has different ideas on postgres future.  That's why we propose
this patch, let's play the game !

I don't like to play games with the architecture of PostgreSQL.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Relation cache invalidation on replica

2016-02-26 Thread Konstantin Knizhnik
The reason of the problem is that invalidation messages are not 
delivered to replica after the end of concurrent create index.
Invalidation messages are included in xlog as part of transaction commit 
record.
Concurrent index create is split into three transaction, last of which 
is just performing inplace update of index tuple, marking it as valid 
and invalidating cache. But as far as this transaction is not assigned 
XID, no transaction record is created in WAL and send to replicas. As a 
result, replica doesn't receive this invalidation messages.


To fix the problem it is just enough to assign XID to transaction.
It can be done by adding GetCurrentTransactionId() call to the end of 
DefineIdnex function:


diff --git a/src/backend/commands/indexcmds.c 
b/src/backend/commands/indexcmds.c

index 13b04e6..1024603 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -881,6 +881,12 @@ DefineIndex(Oid relationId,
 CacheInvalidateRelcacheByRelid(heaprelid.relId);

 /*
+ * Force WAL commit record to ensure that replica receives invalidation
+ * messages.
+ */
+GetCurrentTransactionId();
+
+/*
  * Last thing to do is release the session-level lock on the 
parent table.

  */
 UnlockRelationIdForSession(, ShareUpdateExclusiveLock);




On 26.02.2016 15:41, Васильев Дмитрий wrote:
Session opened on replica doesn't see concurrently created indexes at 
this time on master.


We have master and replica:

1. master: pgbench -i -s 10

2. replica:
explain (analyze,verbose) select * from pgbench_accounts where 
abalance = 1;


3. master:
ALTER INDEX pgbench_accounts_abalance_idx RENAME TO 
pgbench_accounts_abalance_idx_delme;
CREATE INDEX CONCURRENTLY pgbench_accounts_abalance_idx ON 
pgbench_accounts USING btree (abalance);

DROP INDEX pgbench_accounts_abalance_idx_delme;

4. at this time on replica:

explain (analyze,verbose) select * from pgbench_accounts where 
abalance = 1;
pgbench=# explain (analyze,verbose) select * from pgbench_accounts 
where abalance = 1;

QUERY PLAN

Index Scan using pgbench_accounts_abalance_idx on 
public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual 
time=655.781..655.781 rows=0 loops=1)

Output: aid, bid, abalance, filler
Index Cond: (pgbench_accounts.abalance = 1)
Planning time: 9388.259 ms
Execution time: 655.900 ms
(5 rows)

pgbench=# explain (analyze,verbose) select * from pgbench_accounts 
where abalance = 1;

QUERY PLAN
--
Index Scan using pgbench_accounts_abalance_idx_delme on 
public.pgbench_accounts (cost=0.42..4.44 rows=1 width=97) (actual 
time=0.014..0.014 rows=0 loops=1)

Output: aid, bid, abalance, filler
Index Cond: (pgbench_accounts.abalance = 1)
Planning time: 0.321 ms
Execution time: 0.049 ms
(5 rows)

pgbench=# explain (analyze,verbose) select * from pgbench_accounts 
where abalance = 1;

QUERY PLAN

Seq Scan on public.pgbench_accounts (cost=0.00..28894.00 rows=1 
width=97) (actual time=3060.451..3060.451 rows=0 loops=1)

Output: aid, bid, abalance, filler
Filter: (pgbench_accounts.abalance = 1)
Rows Removed by Filter: 100
Planning time: 0.087 ms
Execution time: 3060.484 ms
(6 rows)

pgbench=# \d+ pgbench_accounts
Table "public.pgbench_accounts"
Column | Type | Modifiers | Storage | Stats target | Description
--+---
aid | integer | not null | plain | |
bid | integer | | plain | |
abalance | integer | | plain | |
filler | character(84) | | extended | |
Indexes:
"pgbench_accounts_pkey" PRIMARY KEY, btree (aid)
"pgbench_accounts_abalance_idx" btree (abalance)
Options: fillfactor=100

​New opened session successfully uses this index.
Tested with PostgreSQL 9.5.1.

---
Dmitry Vasilyev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] The plan for FDW-based sharding

2016-02-24 Thread Konstantin Knizhnik
n see what workloads it covers and
decide if we are willing to copy the volume of code necessary
to implement all supported Postgres XC or XL workloads.
(The Postgres XL license now matches the Postgres license,
http://www.postgres-xl.org/2015/07/license-change-and-9-5-merge/.
Postgres XC has always used the Postgres license.)

If we are not willing to add code for the missing Postgres XC/XL
features, Postgres XC/XL will probably remain a separate fork of
Postgres.  I don't think anyone knows the answer to this question, and I
don't know how to find the answer except to keep going with our current
FDW sharding approach.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pglogical - logical replication contrib module

2016-02-17 Thread Konstantin Knizhnik

Ok, what about the following plan:

1. Support custom WAL records (as far as I know 2ndQuadrant has such patch).
2. Add one more function to logical decoding allowing to deal with 
custom records.


So the idea is that we somehow record DDL in WAL (for example using 
executor hook),
then them are proceeded using logical decoding, calling special logical 
deocding plugin function to handle this records.
For example we can store DDL in WAL just as SQL statements and so easily 
replay them.


In this case DDL will be replicated using the same mechanism and through 
the same channel as DML.



On 17.02.2016 12:16, Craig Ringer wrote:
On 17 February 2016 at 16:24, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:


Thanks for your explanation. I have to agree with your arguments
that in general case replication of DDL statement using logical
decoding seems to be problematic. But we are mostly considering
logical decoding in quite limited context: replication between two
identical Postgres database nodes (multimaster).


Yep, much like BDR. Where all this infrastructure came from and is/was 
aimed at.


Do you think that it in this case replication of DLL can be done
as sequence of low level operations with system catalog tables
including manipulation with locks?


No.

For one thing logical decoding doesn't see catalog tuple changes right 
now. Though I imagine that could be changed easily enough.


More importantly - oids. You add a column to a table:

ALTER TABLE mytable ADD COLUMN mycolumn some_type UNIQUE NOT NULL 
DEFAULT some_function()


This writes to catalogs including:

pg_attribute
pg_constraint
pg_index
pg_class (for the index relation)

... probably more. It also refers to pg_class (for the definition of 
mytable), pg_type (definition of some_type), pg_proc (definition of 
some_function), the b-tree operator class for some_type in pg_opclass, 
the b-tree indexam in pg_am, ... more.


Everything is linked by oids, and the oids are all node local. You 
can't just blindly re-use them. If "some_type" is hstore, the oid of 
hstore in pg_type might be different on the upstream and downstream. 
The only exception is the oids of built-in types and even then that's 
not guaranteed across major versions.


So if you blindly replicate catalog row changes you'll get a horrible 
mess. That's before considering a table's relfilenode, which is 
initially the same as its oid, but subject to change if truncated or 
rewritten.


To even begin to do this half-sanely you'd have to maintain a mapping 
of upstream object oids->names on the downstream, with invalidations 
replicated from the upstream. That's only the beginning. There's 
handling of extensions and lots more fun.


So in your example with ALTER TABLE statement, can we correctly
replicate it to other nodes
as request to set exclusive lock + some manipulations with catalog
tables and data table itself?


Nope. No hope, not unless "some manipulations with catalog tables and 
data table its self" is a lot more comprehensive than I think you mean.


1. Add option whether to include operations on system catalog
tables in logical replication or not.


I would like to have this anyway.

2. Make it possible to replicate lock requests (can be useful not
only for DDLs)


I have no idea how you'd even begin to do that.

I looked how DDL was implemented in BDR and did it in similar way
in our multimaster.
But it is awful: we need to have two different channels for
propagating changes.


Yeah, it's not beautiful, but maybe you misunderstood something? The 
DDL is written to a table, and that table's changes are replayed along 
with everything else. It's consistent and keeps DDL changes as part of 
the xact that performed them. Maybe you misunderstood how it works in 
BDR and missed the indirection via a table?


Additionally, in multimaster we want to enforce cluster wide ACID.
It certainly includes operations with metadata. It will be very
difficult to implement if replication of DML and DDL is done in
two different ways...


That's pretty much why BDR does it this way, warts and all. Though it 
doesn't offer cluster-wide ACID it does need atomic commit of xacts 
that may contain DML, DDL, or some mix of the two.


Let me ask one more question concerning logical replication: how
difficult it will be from your point of view to support two phase
commit in logical replication? Are there some principle problems?


I haven't looked closely yet. Andres will know more.

I very, very badly want to be able to decode 2PC prepared xacts myself.

The main issue I'm aware of is locking - specifically the inability to 
impersonate another backend and treat locks held by that backend 
(which might be a fake backend for a pg_prepared_xacts entry) as held 
by ourselves for the purpose of being able t

Re: [HACKERS] pglogical - logical replication contrib module

2016-02-17 Thread Konstantin Knizhnik
t;don't create FKs pointing from non-replicated 
tables to tables replicated from somewhere else". A concept we 
currently have no way to express or enforce like we do 
persistent-to-UNLOGGED FKs.




Then there's global objects. Something as simple as:

CREATE ROLE fred;

CREATE TABLE blah(...) OWNER fred;

will break replication because we only see the CREATE TABLE, not the 
CREATE ROLE. If we instead replayed the CREATE ROLE and there were 
multiple connections between different DBs on an upstream and 
downstream apply would fail on all but one. But we can't anyway since 
there's no way to capture that CREATE ROLE from any DB except the one 
it was executed in, which might not even be one of the ones doing 
replication.


I strongly suspect we'll need logical decoding to be made aware of 
such global DDL and decode it from the WAL writes to the system 
catalogs. Which will be fun - but at least modifications to the shared 
catalogs are a lot simpler than the sort of gymnastics done by ALTER 
TABLE, etc.




--
 Craig Ringer http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Clock with Adaptive Replacement

2016-02-12 Thread Konstantin Knizhnik

Thank you very much for response.

I am not sure that CART can significantly  improve PostgreSQL 
performance - I just want to know opinion of community about

CAR/CART and other possible alternative to GCLOCK algorithm.

Looks like it CAR really provides better cache hit ratio and so at some 
workloads should increase Postgres performance.
But now amount of memory at servers is large enough to completely keep 
most of typical databases in cache.
So time of locating buffer in cache is more critical then time of buffer 
eviction.

And here CART doesn't provide any benefits comparing with GCLOCK algorithm.

One of the problems with GCLOCK algorithm from my point of view is that 
for large caches, containing larger number of pages locating victim page 
can take substantial amount of time, because we have to perform several 
turnovers before some count becomes zero.  In theory CART can address 
this problem because there are not counters - justs single bit per page.





On 12.02.2016 18:55, Robert Haas wrote:

On Thu, Feb 11, 2016 at 4:02 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

What do you think about improving cache replacement clock-sweep algorithm in
PostgreSQL with adaptive version proposed in this article:

 http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be
interesting to adopt this algorithm to PostgreSQL and measure it impact om
performance under different workloads?
I find this ten years old thread:

http://www.postgresql.org/message-id/flat/d2jkde$6bg$1...@sea.gmane.org#d2jkde$6bg$1...@sea.gmane.org

but it mostly discus possible patent issues with another algorithm ARC (CAR
is inspired by ARC,  but it is different algorithm).
As far as I know there are several problems with current clock-sweep
algorithm in PostgreSQL, especially for very large caches.
May be CAR can address some of them?

Maybe, but the proof of the pudding is in the eating.  Just because an
algorithm is smarter, newer, and better in general than our current
algorithm - and really, it wouldn't be hard - doesn't mean that it
will actually solve the problems we care about.  A few of my
EnterpriseDB colleagues spent a lot of time benchmarking various
tweaks to our current algorithm last year and were unable to construct
a test case where it sped anything up.  If they tried the same tweaks
against the 9.4 source base, they could get a speedup.  But 9.5 had
locking improvements around buffer eviction, and with those
improvements committed there was no longer any measurable benefit to
improving the quality of buffer eviction decisions.  That's a
surprising result, to me anyway, and somebody else might well find a
test case where a benefit can be shown - but our research was not
successful.

I think it's important to spend time and energy figuring out exactly
what the problems with our current algorithm are.  We know in general
terms that usage counts tend to converge to either 5 or 0 and
therefore sometimes evict buffers both at great cost and almost
randomly.  But what's a lot less clear is how much that actually hurts
us given that we are relying on the OS cache anyway.  It may be that
we need to fix some other things before or after improving the buffer
eviction algorithm before we actually get a performance benefit.  I
suspect, for example, that a lot of the problems with large
shared_buffers settings have to do with the bgwriter and checkpointer
behavior rather than with the buffer eviction algorithm; and that
others have to do with cache duplication between PostgreSQL and the
operating system.  So, I would suggest (although of course it's up to
you) that you might want to focus on experiments that will help you
understand where the problems are before you plunge into writing code
to fix them.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Clock with Adaptive Replacement

2016-02-11 Thread Konstantin Knizhnik

Hi hackers,

What do you think about improving cache replacement clock-sweep algorithm in 
PostgreSQL with adaptive version proposed in this article:

http://www-cs.stanford.edu/~sbansal/pubs/fast04.pdf

Are there some well known drawbacks of this approach or it will be interesting 
to adopt this algorithm to PostgreSQL and measure it impact om performance 
under different workloads?
I find this ten years old thread:

http://www.postgresql.org/message-id/flat/d2jkde$6bg$1...@sea.gmane.org#d2jkde$6bg$1...@sea.gmane.org

but it mostly discus possible patent issues with another algorithm ARC (CAR is 
inspired by ARC,  but it is different algorithm).
As far as I know there are several problems with current clock-sweep algorithm 
in PostgreSQL, especially for very large caches.
May be CAR can address some of them?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] eXtensible Transaction Manager API (v2)

2016-02-10 Thread Konstantin Knizhnik

Hi,

PostgresProffesional cluster teams wants to propose new version of 
eXtensible Transaction Manager API.

Previous discussion concerning this patch can be found here:

http://www.postgresql.org/message-id/f2766b97-555d-424f-b29f-e0ca0f6d1...@postgrespro.ru

The API patch itself is small enough, but we think that it will be 
strange to provide just API without examples of its usage.


We have implemented various implementations of distributed transaction 
manager based on this API:
pg_dtm (based ion snapshot sharing) and pg_tsdtm (CSN based on local 
system time).
Based on this two DTM implementation we have developed various "cluster" 
implementations:
multimaster+pg_dtm, multimaster+pg_tsdtm, pg_shard+pg_dtm, 
pg_shard+pg_tsdtm, postgres_fdw+pg_dtm, postgres_fdw+pg+tsdtm,... 
Multimaster is based on logical replication is something like BDR but 
synchronous: provide consistency across cluster.


But we want to make this patch as small as possible.
So we decided to include in it only pg_tsdtm and patch of postgres_fdw 
allowing to use it with pg_tsdtm.
pg_tsdtm is simpler than pg_dtm because last one includes arbiter with 
RAFT protocol (centralized service)

and sockhub for efficient multiplexing backend connections.
Also, in theory, pg_tsdtm provides better scalability, because it is 
decentralized.


Architecture of DTM and tsDTM as well as benchmark results are available 
at WiKi page:


https://wiki.postgresql.org/wiki/DTM

Please notice pg-tsdtm is just reference implementation of DTM using 
this XTM API.
The primary idea of this patch is to add XTM API to PostgreSQL code, 
allowing to implement own transaction managers as
Postgres extension. So please review first of all XTM API itself and not 
pg_tsdtm which is just and example of its usage.


The complete PostgreSQL branch with all our changes can be found here:

https://github.com/postgrespro/postgres_cluster


-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com 
The Russian Postgres Company
diff --git a/contrib/pg_tsdtm/Makefile b/contrib/pg_tsdtm/Makefile
new file mode 100644
index 000..e70dffc
--- /dev/null
+++ b/contrib/pg_tsdtm/Makefile
@@ -0,0 +1,20 @@
+MODULE_big = pg_tsdtm
+OBJS = pg_tsdtm.o
+
+EXTENSION = pg_tsdtm
+DATA = pg_tsdtm--1.0.sql
+
+ifdef USE_PGXS
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+include $(PGXS)
+else
+subdir = contrib/pg_tsdtm
+top_builddir = ../..
+include $(top_builddir)/src/Makefile.global
+include $(top_srcdir)/contrib/contrib-global.mk
+endif
+
+check:
+	env DESTDIR='$(abs_top_builddir)'/tmp_install make install
+	$(prove_check)
diff --git a/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp b/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp
new file mode 100644
index 000..38285be
--- /dev/null
+++ b/contrib/pg_tsdtm/dtm_recovery/dtm_recovery.cpp
@@ -0,0 +1,129 @@
+#include 
+#include 
+#include 
+#include 
+
+#include 
+#include 
+#include 
+
+using namespace std;
+using namespace pqxx;
+
+int main (int argc, char* argv[])
+{
+if (argc == 1){
+printf("Use -h to show usage options\n");
+return 1;
+}
+vector connections;
+set prepared_xacts;
+set committed_xacts;
+bool verbose = false;
+for (int i = 1; i < argc; i++) {
+if (argv[i][0] == '-') {
+switch (argv[i][1]) {
+  case 'C':
+  case 'c':
+connections.push_back(string(argv[++i]));
+continue;
+  case 'v':
+verbose = true;
+continue;
+}
+}
+printf("Perform recovery of pg_tsdtm cluster.\n"
+   "Usage: dtm_recovery {options}\n"
+   "Options:\n"
+   "\t-c STR\tdatabase connection string\n"
+   "\t-v\tverbose mode: print extra information while processing\n");
+return 1;
+}
+if (verbose) {
+cout << "Collecting information about prepared transactions...\n";
+}
+for (vector::iterator ic = connections.begin(); ic != connections.end(); ++ic)
+{
+if (verbose) {
+cout << "Connecting to " << *ic << "...\n";
+}
+connection con(*ic);
+work txn(con);
+result r = txn.exec("select gid from pg_prepared_xacts");
+for (result::const_iterator it = r.begin(); it != r.end(); ++it)
+{
+string gid = it.at("gid").as(string());
+prepared_xacts.insert(gid);
+}
+txn.commit();
+}
+if (verbose) {
+cout << "Prepared transactions: ";
+for (set::iterator it = prepared_xacts.begin(); it != prepared_xacts.end(); ++it)
+{
+cout << *it << ", ";
+}
+cout << "\nChecking which of them a

Re: [HACKERS] Batch update of indexes

2016-02-03 Thread Konstantin Knizhnik

Attached please find patch for "ALTER INDEX ... WHERE ..." clause.
It is now able to handle all three possible situations:
1. Making index partial (add WHERE condition to the ordinary index)
2. Extend partial index range (less restricted index predicate)
3. Arbitrary change of partial index predicate

In case 2) new records are added to the index.
In other two cases index is completely reconstructed.

This patch includes src/bin/insbench utility for testing insert 
performance. It can be easily excluded from the patch to reduce it size.
Also it is better to apply this patch together with "index-only scans 
with partial indexes" patch:


http://www.postgresql.org/message-id/560c7213.3010...@2ndquadrant.com

only in this case regression test will produce expected output.


On 27.01.2016 23:15, Robert Haas wrote:

On Wed, Jan 20, 2016 at 4:28 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Please notice that such alter table statement, changing condition for
partial index, is not supported now.
But I do not see any principle problems with supporting such construction.
We should just include in the index all records which match new condition
and do not match old condition:

ts < '21/01/2016' and not (ts < '20/01/2016')

You'd also need to remove any rows from the index that match the old
condition but not the new one.  In your example, that's impossible,
but in general, it's definitely possible.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/commands/indexcmds.c b/src/backend/commands/indexcmds.c
index b450bcf..b6ffb19 100644
--- a/src/backend/commands/indexcmds.c
+++ b/src/backend/commands/indexcmds.c
@@ -32,6 +32,7 @@
 #include "commands/tablespace.h"
 #include "mb/pg_wchar.h"
 #include "miscadmin.h"
+#include "funcapi.h"
 #include "nodes/nodeFuncs.h"
 #include "optimizer/clauses.h"
 #include "optimizer/planner.h"
@@ -50,6 +51,9 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
+#include "utils/ruleutils.h"
+#include "executor/executor.h"
+#include "executor/spi.h"
 
 
 /* non-export function prototypes */
@@ -275,6 +279,160 @@ CheckIndexCompatible(Oid oldId,
 	return ret;
 }
 
+static void 
+UpdateIndex(Oid indexRelationId, Node* whereClause)
+{
+	Datum		values[Natts_pg_index];
+	bool		isnull[Natts_pg_index];
+	HeapTuple   oldTuple;
+	HeapTuple   newTuple;
+	Relation	pg_index;
+
+	pg_index = heap_open(IndexRelationId, RowExclusiveLock);
+	oldTuple = SearchSysCacheCopy1(INDEXRELID, ObjectIdGetDatum(indexRelationId));
+	if (!HeapTupleIsValid(oldTuple))
+		elog(ERROR, "cache lookup failed for index %u", indexRelationId);
+
+	heap_deform_tuple(oldTuple, RelationGetDescr(pg_index), values, isnull);
+	values[Anum_pg_index_indpred - 1] = CStringGetTextDatum(nodeToString(whereClause));
+	isnull[Anum_pg_index_indpred - 1] = false;
+	newTuple = heap_form_tuple(RelationGetDescr(pg_index), values, isnull);
+	simple_heap_update(pg_index, >t_self, newTuple);
+	CatalogUpdateIndexes(pg_index, newTuple);
+	heap_freetuple(newTuple);
+	heap_freetuple(oldTuple);
+	heap_close(pg_index, NoLock);
+}
+
+void
+AlterIndex(Oid indexRelationId, IndexStmt *stmt)
+{
+	char* select;
+	Oid heapRelationId;
+	IndexUniqueCheck checkUnique;
+	Datum		values[INDEX_MAX_KEYS];
+	bool		isnull[INDEX_MAX_KEYS];
+	Relation heapRelation;
+	Relation indexRelation;
+SPIPlanPtr plan;
+Portal portal;
+	HeapTuple tuple;
+	TupleTableSlot *slot;
+	ItemPointer tupleid;
+	IndexInfo  *indexInfo;
+	EState *estate;
+	Oid	namespaceId;
+	List*   deparseCtx;
+	char*   oldIndexPredicate;
+	char*   newIndexPredicate;
+	char*   relationName;
+
+	Assert(stmt->whereClause);
+	CheckPredicate((Expr *) stmt->whereClause);
+
+	/* Open and lock the parent heap relation */
+	heapRelationId = IndexGetRelation(indexRelationId, false);
+	heapRelation = heap_open(heapRelationId, AccessShareLock);
+
+	/* Open the target index relation */
+	/*	indexRelation = index_open(indexRelationId, RowExclusiveLock); */
+	indexRelation = index_open(indexRelationId, ShareUpdateExclusiveLock);
+	/* indexRelation = index_open(indexRelationId, AccessShareLock); */
+	namespaceId = RelationGetNamespace(indexRelation);
+
+	indexInfo = BuildIndexInfo(indexRelation);
+	Assert(!indexInfo->ii_ExclusionOps);
+ 
+	/*
+	 * Generate the constraint and default execution states
+	 */
+	estate = CreateExecutorState();
+
+	checkUnique = indexRelation->rd_index->indisunique ? UNIQUE_CHECK_YES : UNIQUE_CHECK_NO;
+
+	slot = MakeSingleTupleTableSlot(RelationGetDescr(heapRelation));
+	
+	deparseCtx = deparse_context_for(RelationGetRelationName(heapRelation), heapRelationId);
+	relationName = quote_qualified_identifier(get_namespace_name(namespaceId),
+			  get_

Re: [HACKERS] Batch update of indexes

2016-02-03 Thread konstantin knizhnik

On Feb 4, 2016, at 2:00 AM, Jim Nasby wrote:

> 
> My suspicion is that it would be useful to pre-order the new data before 
> trying to apply it to the indexes.

Sorry, but ALTER INDEX is expected to work for all indexes, not only B-Tree, 
and for them sorting may not be possible...
But for B-Tree presorting inserted data should certainly increase performance. 
I will think about it.



> -- 
> Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX
> Experts in Analytics, Data Architecture and PostgreSQL
> Data in Trouble? Get it in Treble! http://BlueTreble.com
> 
> 
> -- 
> Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-hackers



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: index-only scans with partial indexes

2016-02-02 Thread Konstantin Knizhnik
I have applied this patch to our working branch and  during several 
weeks we ran various tests and benchmarks.

We have not noticed any problems or performance degradation.
And at some queries this patch cause very significant increase of 
performance - ten times:


With this patch:

postgres=# explain analyze select count(*) from t where k1<100 and 
pk < 1454434742881892;

   QUERY PLAN

 Aggregate  (cost=29.65..29.66 rows=1 width=0) (actual 
time=0.108..0.108 rows=1 loops=1)
   ->  Index Only Scan using idx1 on t  (cost=0.43..27.49 rows=861 
width=0) (actual time=0.012..0.071 rows=963 loops=1)

 Index Cond: (k1 < 100)
 Heap Fetches: 0
 Planning time: 0.100 ms
 Execution time: 0.121 ms
(6 rows)


Without patch:

postgres=# explain analyze select count(*) from t where k1<100 and 
pk < 1454434742881892;

   QUERY PLAN

 Aggregate  (cost=2951.55..2951.56 rows=1 width=0) (actual 
time=1.070..1.070 rows=1 loops=1)
   ->  Bitmap Heap Scan on t  (cost=19.10..2949.40 rows=861 width=0) 
(actual time=0.161..0.997 rows=963 loops=1)
 Recheck Cond: ((k1 < 100) AND (pk < 
'1454434742881892'::bigint))

 Heap Blocks: exact=954
 ->  Bitmap Index Scan on idx1  (cost=0.00..18.88 rows=861 
width=0) (actual time=0.083..0.083 rows=963 loops=1)

   Index Cond: (k1 < 100)
 Planning time: 0.099 ms
 Execution time: 1.089 ms
(8 rows)




On 01.02.2016 01:11, Alvaro Herrera wrote:

Konstantin Knizhnik wrote:

I am very interested in this patch because it allows to use partial indexes to 
... speed up inserts.
I have implemented "ALTER INDEX ... WHERE ..." construction which allows to 
change predicate of partial index without necessity to fully rebuild it.
So it is not necessary to insert new records in index immediately (if new 
records do not match partial index conditions).
It can be done later in background (or at night). My experiments show that it 
allows to increase insert speed five times (for either partial indexes).
At the same time we do not loose RDBMS requirement that result of query should 
not depend on presence of indexes. And it is applicable to all indexes: B-Tree, 
GIN, GIST,...

But such optimization makes sense only of partial indexes can be used without 
extra overhead, first of all for index-only scans.
And it is impossible without this patch.

That sounds interesting.  So please review this patch and let us know
whether you like it, or whether you have any better ideas for any
particular hunk, or whether you think it should be rewritten from
scratch, or just state that it is perfect.  If you think it's useful,
then it's a good idea to vouch for it to be integrated; and one way of
doing that is making sure it meets the quality standards etc.  If you
don't say anything about the patch, we may never integrate it because we
might have doubts about whether it's worthy.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] PATCH: index-only scans with partial indexes

2016-01-31 Thread Konstantin Knizhnik

I am very interested in this patch because it allows to use partial indexes to 
... speed up inserts.
I have implemented "ALTER INDEX ... WHERE ..." construction which allows to 
change predicate of partial index without necessity to fully rebuild it.
So it is not necessary to insert new records in index immediately (if new 
records do not match partial index conditions).
It can be done later in background (or at night). My experiments show that it 
allows to increase insert speed five times (for either partial indexes).
At the same time we do not loose RDBMS requirement that result of query should 
not depend on presence of indexes. And it is applicable to all indexes: B-Tree, 
GIN, GIST,...

But such optimization makes sense only of partial indexes can be used without 
extra overhead, first of all for index-only scans.
And it is impossible without this patch.






On 01/31/2016 03:34 PM, Alvaro Herrera wrote:

Tomas Vondra wrote:


On 12/24/2015 04:05 AM, Michael Paquier wrote:

Tomas, are you still working on that? This thread is stalling for 3 weeks.

I haven't discovered anything interesting during the testing, so I guess the
"needs review" state is appropriate. Let's move the patch to the next
commitfest.

Not sure what to do here, since this patch got no feedback at all in
this CF.  The right thing to do, ISTM, is to just move it again to the
next CF.  But it'd be really useful if someone can have it a look and
verify at least whether it doesn't need a rebase (requiring a further
submission) so that other people can play with it.  Of course, if
Horiguchi-san or anyone has more review comments, that would be even
better.

Tomas said he'd do more testing, but we never got a report on whether
anything turned up.

(At this point I'm not sure if either Kyotaro or Tomas should be
considered the patch author ... maybe both?)




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Optimizer questions

2016-01-30 Thread Konstantin Knizhnik

Unfortunately this two statements are not equivalent: second one can (in 
theory, but not for this particular data set) return more than 10 result 
records.
Such optimization will be correct if t2.i is declared as unique.

But the most efficient plan for this query will be generated if there is index 
for t1.v.
In this case no explicit sot is needed. Limit is still not pushed down, but it 
is not a problem because nested join is used which is not materializing its 
result
(produces records on demand):

# explain analyze select * from t1 left outer join t2 on t1.k=t2.k order by 
t1.v limit 10;
  QUERY PLAN
--
 Limit  (cost=0.58..4.38 rows=10 width=16) (actual time=0.069..0.157 rows=10 
loops=1)
   ->  Nested Loop Left Join  (cost=0.58..37926.63 rows=11 width=16) 
(actual time=0.067..0.154 rows=10 loops=1)
 ->  Index Scan using idxv on t1  (cost=0.29..3050.31 rows=11 
width=8) (actual time=0.046..0.053 rows=10 loops=1)
 ->  Index Scan using t2_pkey on t2  (cost=0.29..0.34 rows=1 width=8) 
(actual time=0.007..0.007 rows=1 loops=10)
   Index Cond: (t1.k = k)
 Planning time: 0.537 ms
 Execution time: 0.241 ms
(7 rows)


On 01/30/2016 01:01 AM, Alexander Korotkov wrote:

On Fri, Jan 8, 2016 at 11:58 AM, Konstantin Knizhnik <k.knizh...@postgrespro.ru 
<mailto:k.knizh...@postgrespro.ru>> wrote:

Attached please find improved version of the optimizer patch for LIMIT 
clause.
Now I try to apply this optimization only for non-trivial columns requiring 
evaluation.
May be it will be better to take in account cost of this columns evaluation 
but right now I just detect non-variable columns.


We may think about more general feature: LIMIT pushdown. In the Konstantin's 
patch planner push LIMIT before tlist calculation.
But there are other cases when calculating LIMIT first would be beneficial. For 
instance, we can do LIMIT before JOINs. That is possible only for LEFT JOIN 
which is not used in filter and ordering clauses. See the example below.

# create table t1 as (select i, random() v from generate_series(1,100) i);
SELECT 100

# create table t2 as (select i, random() v from generate_series(1,100) i);
SELECT 100

# explain analyze select * from t1 left join t2 on t1.i = t2.i order by t1.v 
limit 10;
  QUERY PLAN

 Limit  (cost=87421.64..87421.67 rows=10 width=24) (actual 
time=1486.276..1486.278 rows=10 loops=1)
   ->  Sort  (cost=87421.64..89921.64 rows=100 width=24) (actual 
time=1486.275..1486.275 rows=10 loops=1)
 Sort Key: t1.v
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Hash Left Join  (cost=27906.00..65812.00 rows=100 width=24) 
(actual time=226.180..1366.238 rows=100
   Hash Cond: (t1.i = t2.i)
 ->  Seq Scan on t1  (cost=0.00..15406.00 rows=100 width=12) (actual 
time=0.010..77.041 rows=100 l
 ->  Hash  (cost=15406.00..15406.00 rows=100 width=12) (actual 
time=226.066..226.066 rows=100 loop
 Buckets: 131072  Batches: 1  Memory Usage: 46875kB
 ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=12) (actual 
time=0.007..89.002 rows=100
 Planning time: 0.123 ms
 Execution time: 1492.118 ms
(12 rows)

# explain analyze select * from (select * from t1 order by t1.v limit 10) t1 
left join t2 on t1.i = t2.i;
  QUERY PLAN

 Hash Right Join  (cost=37015.89..56171.99 rows=10 width=24) (actual 
time=198.478..301.278 rows=10 loops=1)
   Hash Cond: (t2.i = t1.i)
   ->  Seq Scan on t2  (cost=0.00..15406.00 rows=100 width=12) (actual 
time=0.005..74.303 rows=100 loops=1)
   ->  Hash  (cost=37015.77..37015.77 rows=10 width=12) (actual 
time=153.584..153.584 rows=10 loops=1)
 Buckets: 1024  Batches: 1  Memory Usage: 1kB
 ->  Limit  (cost=37015.64..37015.67 rows=10 width=12) (actual 
time=153.579..153.580 rows=10 loops=1)
 ->  Sort  (cost=37015.64..39515.64 rows=100 width=12) (actual 
time=153.578..153.578 rows=10 loops=1)
 Sort Key: t1.v
 Sort Method: top-N heapsort  Memory: 25kB
 ->  Seq Scan on t1  (cost=0.00..15406.00 rows=100 width=12) (actual 
time=0.012..78.828 rows=100
 Planning time: 0.132 ms
 Execution time: 301.308 ms
(12 rows)

In this example LIMIT pushdown makes query 5 times faster. It would be very 
nice if optimizer make this automatically.

--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com <http://www.postgrespro.com/>
The Russian Postgres Comp

Re: [HACKERS] Batch update of indexes

2016-01-26 Thread Konstantin Knizhnik

Hi hackers,

I have implemented "ALTER INDEX ... WHERE ..." clause allowing to change 
condition for partial index.

Actually it allows us to append index without fully rebuilding it.
As I explained in the previous mails, partial indexes can be used to 
increase insert speed.

Right now I get the following results (with one insert stream):

Insert with 1 index (primary key, monotonically ascending): 
324275 TPS
Insert with 9 indexes (primary key + 8 indexes with random keys):   
52495 TPS
Insert with primary key and 8 concurrently updated partial indexes: 
194458 TPS
Insert with primary key and 8 "frozen" partial 
indexes:  278446 TPS


So, as you can see insert with indexes is about 6 times slower than 
insert without indexes.

And partial indexes allows to eliminate  this gap.
When partial indexes are not affected (assuming that them will be 
reconstructed "at night"),

performance is almost the same, as without indexes.
And if "ALTER INDEX" is done concurrently with inserts, it certainly 
decrease insert speed,

but still it is 4 times faster than with normal indexes.

Such high TPS values were obtained using "insert from select" to bypass 
libpq overhead.
With libpq (when each insert is sent as independent statement) results 
are less impressive:


Insert with 1 index (primary key, monotonically ascending): 
37892 TPS
Insert with 9 indexes (primary key + 8 indexes with random keys): 
20231 TPS
Insert with primary key and 8 concurrently updated partial indexes: 
26934 TPS
Insert with primary key and 8 "frozen" partial 
indexes:  28863 TPS


But still partial indexes allows to almost eliminate two times 
differences...


This results can be reproduced using our public repository:
https://github.com/postgrespro/postgres_cluster

Most of the code related with support of "ALTER INDEX .. WHERE" is in 
AlterIndex function in

postgres_cluster/src/backend/commands/indexcmds.c
I have also added insbench utility for measuring insert performance, 
using which this results were obtained.

It is located in postgres_cluster/src/bin/insbench directory.

Known issues:
1. I do not handle case when new condition for partial index is more 
restricted than original.
There is no way in Postgres to exclude records from index (except 
VACUUM), so in this case index has to be reconstructed from scratch.
2. Currently I am using SPI to locate records which should be included 
in index.
3. I am not  completely sure that  there are no 
synchronization/isolation problems in AlterIndex function


If this approach is considered to be interesting by community, I will 
try to address these issues.



On 20.01.2016 12:28, Konstantin Knizhnik wrote:

Hi hackers,

I want to know opinion of community about possible ways of solving 
quite common problem: increasing insert speed while still providing 
indexes for efficient execution of queries.


Many applications have to deal with high input stream of data. Most of 
the time while record inserting in the database is taken for update of 
indexes. And without indexes we are not able to efficiently execute 
queries.
So in many cases it is desirable to have "batch or concurrent" index 
update. And it is acceptable that an index is slightly behind current 
state of the table.


One interesting approach of solving this problem is discussed in this 
article:


https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb

Them are using materialized views to build indexes in background.
Interesting idea, but copying content of the whole table just to be 
able to build index concurrently seems to be overkill.


I thought about more straightforward ways of solving this problem. It 
will be nice if we can preserve of of them main postulates of Postgres 
and other RDBMSes: indexes are just optimization and result of query 
should not depend on presence of indexes.


First idea is to use inheritance. I have investigated different ways 
of splitting table into "archival" and "operational" parts, but all of 
them requiring physical copying of data from one table to another.


Another idea is to use partial indexes 
(http://www.postgresql.org/docs/current/static/indexes-partial.html)
Assume that we have stream of input data where each record have 
increased timestamp:


  create table t(
 ts timestamp primary key,
 c1 real,
 c2 integer,
 c3 varchar,
 ...
 cN char(5)
  );

We want to provide the highest insert speed for "t" but provide 
indexes for c1..cN fields.

We can declared partial indexes:

  create index idx1 on t(c1) where ts < '20/01/2016';
  create index idx2 on t(c2) where ts < '20/01/2016';
  ...
  create index idxN on t(cN) where ts < '20/01/2016';

As far as this indexes do not cover current date, them will not be 
affecte

Re: [HACKERS] Batch update of indexes

2016-01-21 Thread Konstantin Knizhnik



On 21.01.2016 10:14, Simon Riggs wrote:
On 21 January 2016 at 06:41, konstantin knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:


Certainly for B-Tree we can organize insert buffer (or pending
list) as sorted array or also as a tree.
But in both case complexity of search in this buffer will be
O(log(N)), not O(1).


You are right; search within this buffer is O(log(N)). But we can test 
whether we need to search in the buffer at all with O(1).


Only if records are inserted in key ascending order...
But usually we need to maintain more than once index and and for them it 
is not true.




--
Simon Riggs http://www.2ndQuadrant.com/ <http://www.2ndquadrant.com/>
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Batch update of indexes

2016-01-21 Thread Konstantin Knizhnik



On 21.01.2016 19:09, Anastasia Lubennikova wrote:
What I meant is more like a BRIN-like combination of an index scan and 
heap scan.
Maybe it could be called "deferred inserts" or "temporary read-only 
index"
Maybe it's similar with mysql insert buffer 
http://dev.mysql.com/doc/refman/5.7/en/innodb-insert-buffering.html

I think it'll be more clear with example. Please don't care about syntax.

CREATE TABLE tbl (c1 int);
CREATE INDEX idx on tbl(c1);

SET enable_deferred_insert(idx) = on;
At this moment, we save the last_indexed_item (its TID) somewhere in 
index metapage.


Since that moment, the data inserted into the table doesn't touch the 
index.
We perform some heavy insert and then go back to the normal index 
behavior.


SET enable_deferred_insert(idx) = off;
This command takes all the data between the last_indexed_item and the 
end of the table, and inserts it into the index at a time.


Of course there are new problems to deal with, but it's really useful 
for the use case to balance irregular heavy write load, isn't it?


BTW, could you explain, what is the reason to copy data into the 
pending list and then copy it again while flushing pending list into 
the index? Why not read this data directly from the table? I feel that 
I've missed something important here.


No, I do  not think that inserted data should be placed in pending list 
and then copied to main table.
It should be stored directly in the main table and "pending list" is 
just some fast, transient index.


From my point of view there are two possibilities:
1. Preserve strict table-index consistency: query results should not 
depend on presence of the index
2. Support out-of-date or deferred indexes, which can be updated in 
background.


Second approach is certainty more efficient and IMHO it acceptable for 
most of the users.

But we are violating one of the fundamental properties of RDBMes...
So I am not sure which approach to chose.

First case is also harder to implement, because we have to somehow merge 
two indexes during index scan
and provide proper recovery of main index in case of failure (assuming 
that pending list is maintained in memory and is lost after the fault).



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Batch update of indexes

2016-01-20 Thread konstantin knizhnik

On Jan 21, 2016, at 5:14 AM, Simon Riggs wrote:

> On 20 January 2016 at 14:55, Konstantin Knizhnik <k.knizh...@postgrespro.ru> 
> wrote:
> Hi,
> 
> Hi, I glad to see that you interested in that too.
> I think this is a good feature and I think it will be very useful to have.
> I have already mentioned some related problems and possible improvements in 
> my presentation.
> http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts
> Last two slides concern to this thread. Briefly, I've suggested to think 
> about insertion buffer. Something very similar to it is already implemented 
> in BRIN. It does not index last data from heap, while the number of last 
> pages is less than pages_per_block.
> 
> Do you mean GIN-like usage of insertion buffer (here it is called "pending 
> list")?
> So that we have to combine search in the main tree and in the insert buffer?
> Actually this is what I want to avoided (because at least in case of GIN 
> pending list cause significant degrade of performance,
> while up-to-date state of full text index is rarely required).
> 
> Degrade in performance is because scan of pending list is O(N).
> 
> If we did the same thing for monotonic inserts into a btree, the performance 
> of ruling out any contents in the pending list would be O(1), so it is more 
> feasible than you say.

Sorry, didn't get it.
Certainly for B-Tree we can organize insert buffer (or pending list) as sorted 
array or also as a tree.
But in both case complexity of search in this buffer will be O(log(N)), not 
O(1).
O(1) is possible only if we will use hash table for pending list and are lucky 
with hash function.
But in this case it will be not possible to support range queries and proper 
order.

In any case, necessity to perform two searches instead of one and combining 
results will significantly complicate index implementation.
But definitely this solution is more convenient and transparent for users...


>  
> -- 
> Simon Riggshttp://www.2ndQuadrant.com/
> PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Batch update of indexes

2016-01-20 Thread Konstantin Knizhnik

Hi,


Hi, I glad to see that you interested in that too.
I think this is a good feature and I think it will be very useful to have.
I have already mentioned some related problems and possible 
improvements in my presentation.

http://www.slideshare.net/AnastasiaLubennikova/indexes-dont-mean-slow-inserts
Last two slides concern to this thread. Briefly, I've suggested to 
think about insertion buffer. Something very similar to it is already 
implemented in BRIN. It does not index last data from heap, while the 
number of last pages is less than pages_per_block.


Do you mean GIN-like usage of insertion buffer (here it is called 
"pending list")?

So that we have to combine search in the main tree and in the insert buffer?
Actually this is what I want to avoided (because at least in case of GIN 
pending list cause significant degrade of performance,

while up-to-date state of full text index is rarely required).


The next point, I've thought about is a bulk update. Problem is that 
update like "UPDATE mytable set a = a+1;" causes N searches from the 
root of B-tree. I looks very strange to me, and I'd like to fix it 
somehow. The obvious solution is to update all tuples on the page at a 
time, and keep the number of last updated page. But, maybe it's a bit 
off-thread here.


Bulk update is the second question (but very important).
First I just want to be able to append index concurrently, not blocking 
insert.




One interesting approach of solving this problem is discussed in this 
article:


https://mark.zealey.org/2016/01/08/how-we-tweaked-postgres-upsert-performance-to-be-2-3-faster-than-mongodb

Them are using materialized views to build indexes in background.
Interesting idea, but copying content of the whole table just to be 
able to build index concurrently seems to be overkill.


This approach seems like a tricky crutch to me. And I agree that it 
requires a lot of extra work.


It will be very interesting to know how people are using materialized views.
Delayed building of indexes seems to be one of the popular use cases, 
although requiring large overhead, first of all storage overhead.






Please notice that such alter table statement, changing condition for 
partial index, is not supported now.


Don't you think, that this feature could be used in a very wrong way? 
Do not take it as criticism, just a bit of thoughts.




Everything which can be misused, will be misused:)
But I do not worry much about it...
If it can address real challenges, then it will be good thing in any case.

Ideally we should be able to alter everything. Naive implementation of 
such alter clause can just to build new index with temporary name, then 
drop old index and rename new index.





There was the discussion of the patch for partial indexes.
http://postgresql.nabble.com/PATCH-index-only-scans-with-partial-indexes-td5857568.html
 Since I haven't watched it closely, It seems to be open still. I 
think it'll be interesting to you.




So small patch...
Why it was not accepted?
I do no see any problems with it...



--
Anastasia Lubennikova
Postgres Professional:http://www.postgrespro.com
The Russian Postgres Company


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


<    1   2   3   >