Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-13 Thread Konstantin Knizhnik



On 11.11.2017 23:29, Konstantin Knizhnik wrote:

On 10/27/2017 02:01 PM, Jeevan Chalke wrote:

Hi,

Attached new patch-set here. Changes include:

1. Added separate patch for costing Append node as discussed up-front 
in the

patch-set.
2. Since we now cost Append node, we don't need 
partition_wise_agg_cost_factor
GUC. So removed that. The remaining patch hence merged into main 
implementation

patch.
3. Updated rows in test-cases so that we will get partition-wise plans.

Thanks


I applied partition-wise-agg-v6.tar.gz patch to  the master and use 
shard.sh example from 
https://www.postgresql.org/message-id/14577.1509723225%40localhost

Plan for count(*) is the following:

shard=# explain select count(*) from orders;
  QUERY PLAN
--- 


 Finalize Aggregate  (cost=100415.29..100415.30 rows=1 width=8)
   ->  Append  (cost=50207.63..100415.29 rows=2 width=8)
 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_0 (cost=101.00..50195.13 
rows=5000 width=0)

 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_1 (cost=101.00..50195.13 
rows=5000 width=0)



We really calculate partial aggregate for each partition, but to do we 
still have to fetch all data from remote host.

So for foreign partitions such plans is absolutely inefficient.
Amy be it should be combined with some other patch?
For example, with  agg_pushdown_v4.tgz patch 
https://www.postgresql.org/message-id/14577.1509723225%40localhost ?

But it is not applied after partition-wise-agg-v6.tar.gz patch.
Also postgres_fdw in 11dev is able to push down aggregates without 
agg_pushdown_v4.tgz patch.


In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch
there is the following check:

 /* Partial aggregates are not supported. */
+   if (extra->isPartial)
+   return;

If we just comment this line then produced plan will be the following:

shard=# explain select sum(product_id) from orders;
   QUERY PLAN

 Finalize Aggregate  (cost=308.41..308.42 rows=1 width=8)
   ->  Append  (cost=144.18..308.41 rows=2 width=8)
 ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
   Relations: Aggregate on (public.orders_0 orders)
 ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
   Relations: Aggregate on (public.orders_1 orders)
(6 rows)

And it is actually desired plan!
Obviously such approach will not always work. FDW really doesn't 
support partial aggregates now.
But for most frequently used aggregates: sum, min, max, count 
aggtype==aggtranstype and there is no difference

between partial and normal aggregate calculation.
So instead of (extra->isPartial) condition we can add more complex 
check which will traverse pathtarget expressions and
check if it can be evaluated in this way. Or... extend FDW API to 
support partial aggregation.


But even the last plan is not ideal: it will calculate predicates at 
each remote node sequentially.

There is parallel append patch:
https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com 

but ... FDW doesn't support parallel scan, so parallel append can not 
be applied in this case.
And we actually do not need parallel append with all its dynamic 
workers here.
We just need to start commands at all remote servers and only after it 
fetch results (which can be done sequentially).


I am investigating problem of efficient execution of OLAP queries on 
sharded tables (tables with remote partitions).

After reading all this threads and corresponding  patches, it seems to me
that we already have most of parts of the puzzle, what we need is to 
put them on right places and may be add missed ones.

I wonder if somebody is busy with it and can I somehow help here?

Also I am not quite sure about the best approach with parallel 
execution of distributed query at all nodes.
Should we make postgres_fdw parallel safe and use parallel append? How 
difficult it will be?
Or in addition to parallel append we should also have "asynchronous 
append" which will be able to initiate execution at all nodes?
It seems to be close to merge append, because it should simultaneously 
traverse all cursors.


Looks like second approach is easier for implementation. But in case 
of sharded table, distributed query may need to traverse both remote
and local shards and this approach doesn't allow to processed several 
local shards in parallel.




I attach small patch for postgres_fdw.c which allows concurrent 
execution of aggregates by all remote servers (when them are accessed 
through postgres_fdw).
I have added "postgres_fdw.use_prefetch" GUC t

Re: [HACKERS] Partition-wise aggregation/grouping

2017-11-11 Thread Konstantin Knizhnik

On 10/27/2017 02:01 PM, Jeevan Chalke wrote:

Hi,

Attached new patch-set here. Changes include:

1. Added separate patch for costing Append node as discussed up-front in the
patch-set.
2. Since we now cost Append node, we don't need partition_wise_agg_cost_factor
GUC. So removed that. The remaining patch hence merged into main implementation
patch.
3. Updated rows in test-cases so that we will get partition-wise plans.

Thanks


I applied partition-wise-agg-v6.tar.gz patch to  the master and use shard.sh 
example from https://www.postgresql.org/message-id/14577.1509723225%40localhost
Plan for count(*) is the following:

shard=# explain select count(*) from orders;
  QUERY PLAN
---
 Finalize Aggregate  (cost=100415.29..100415.30 rows=1 width=8)
   ->  Append  (cost=50207.63..100415.29 rows=2 width=8)
 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_0 (cost=101.00..50195.13 rows=5000 
width=0)
 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_1 (cost=101.00..50195.13 rows=5000 
width=0)


We really calculate partial aggregate for each partition, but to do we still 
have to fetch all data from remote host.
So for foreign partitions such plans is absolutely inefficient.
Amy be it should be combined with some other patch?
For example, with  agg_pushdown_v4.tgz patch 
https://www.postgresql.org/message-id/14577.1509723225%40localhost ?
But it is not applied after partition-wise-agg-v6.tar.gz patch.
Also postgres_fdw in 11dev is able to push down aggregates without 
agg_pushdown_v4.tgz patch.

In 0009-Teach-postgres_fdw-to-push-aggregates-for-child-rela.patch
there is the following check:

 /* Partial aggregates are not supported. */
+   if (extra->isPartial)
+   return;

If we just comment this line then produced plan will be the following:

shard=# explain select sum(product_id) from orders;
   QUERY PLAN

 Finalize Aggregate  (cost=308.41..308.42 rows=1 width=8)
   ->  Append  (cost=144.18..308.41 rows=2 width=8)
 ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
   Relations: Aggregate on (public.orders_0 orders)
 ->  Foreign Scan  (cost=144.18..154.20 rows=1 width=8)
   Relations: Aggregate on (public.orders_1 orders)
(6 rows)

And it is actually desired plan!
Obviously such approach will not always work. FDW really doesn't support 
partial aggregates now.
But for most frequently used aggregates: sum, min, max, count 
aggtype==aggtranstype and there is no difference
between partial and normal aggregate calculation.
So instead of (extra->isPartial) condition we can add more complex check which 
will traverse pathtarget expressions and
check if it can be evaluated in this way. Or... extend FDW API to support 
partial aggregation.

But even the last plan is not ideal: it will calculate predicates at each 
remote node sequentially.
There is parallel append patch:
https://www.postgresql.org/message-id/CAJ3gD9ctEcrVUmpY6fq_JUB6WDKGXAGd70EY68jVFA4kxMbKeQ%40mail.gmail.com
but ... FDW doesn't support parallel scan, so parallel append can not be 
applied in this case.
And we actually do not need parallel append with all its dynamic workers here.
We just need to start commands at all remote servers and only after it fetch 
results (which can be done sequentially).

I am investigating problem of efficient execution of OLAP queries on sharded 
tables (tables with remote partitions).
After reading all this threads and corresponding  patches, it seems to me
that we already have most of parts of the puzzle, what we need is to put them 
on right places and may be add missed ones.
I wonder if somebody is busy with it and can I somehow help here?

Also I am not quite sure about the best approach with parallel execution of 
distributed query at all nodes.
Should we make postgres_fdw parallel safe and use parallel append? How 
difficult it will be?
Or in addition to parallel append we should also have "asynchronous append" 
which will be able to initiate execution at all nodes?
It seems to be close to merge append, because it should simultaneously traverse 
all cursors.

Looks like second approach is easier for implementation. But in case of sharded 
table, distributed query may need to traverse both remote
and local shards and this approach doesn't allow to processed several local 
shards in parallel.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Aggregates push-down to partitions

2017-11-10 Thread Konstantin Knizhnik



On 10.11.2017 12:15, Ashutosh Bapat wrote:

Maybe in this thread[1] your described problem are solved through

introducing Parallel Append node?

1.
https://www.postgresql.org/message-id/CAJ3gD9dy0K_E8r727heqXoBmWZ83HwLFwdcaSSmBQ1%2BS%2BvRuUQ%40mail.gmail.com

You may want to review [2] and [3] as well.

[2] https://www.postgresql.org/message-id/9666.1491295317@localhost
[3] 
https://www.postgresql.org/message-id/CAM2+6=V64_xhstVHie0Rz=kpeqnljmzt_e314p0jat_oj9m...@mail.gmail.com

Thank you very much for this references.
I applied partition-wise-agg-v6 patches and for partitioned tables it 
works perfectly:


shard=# explain select count(*) from orders;
  QUERY PLAN
---
 Finalize Aggregate  (cost=100415.29..100415.30 rows=1 width=8)
   ->  Append  (cost=50207.63..100415.29 rows=2 width=8)
 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_0 (cost=101.00..50195.13 
rows=5000 width=0)

 ->  Partial Aggregate  (cost=50207.63..50207.64 rows=1 width=8)
   ->  Foreign Scan on orders_1 (cost=101.00..50195.13 
rows=5000 width=0)

(6 rows)

But I wonder why the same optimization is not applied to normal 
inherited table:


shard=# explain select count(*) from base;
    QUERY PLAN
--
 Aggregate  (cost=44087.99..44088.00 rows=1 width=8)
   ->  Append  (cost=0.00..39079.46 rows=2003414 width=0)
 ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=0)
 ->  Seq Scan on derived1  (cost=0.00..14425.00 rows=100 
width=0)
 ->  Seq Scan on derived2  (cost=0.00..14425.00 rows=100 
width=0)
 ->  Foreign Scan on derived_fdw  (cost=100.00..212.39 
rows=3413 width=0)

(6 rows)

Are there some principle problems?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Aggregates push-down to partitions

2017-11-09 Thread Konstantin Knizhnik

There is a huge thread concerning pushing-down aggregates to FDW:

https://www.postgresql.org/message-id/flat/CAFjFpRcnueviDpngJ3QSVvj7oyukr9NkSiCspqd4N%2BdCEdvYvg%40mail.gmail.com#cafjfprcnuevidpngj3qsvvj7oyukr9nksicspqd4n+dcedv...@mail.gmail.com

but as far as I understand nothing is done for efficient calculation of 
aggregates for partitioned table.
In case of local partitions it is somehow compensated by parallel query 
plan:


postgres=# create table base(x integer);
CREATE TABLE
postgres=# create table derived1() inherits (base);
CREATE TABLE
postgres=# create table derived2() inherits (base);
CREATE TABLE
postgres=# insert into derived1  values (generate_series(1,100));
INSERT 0 100
postgres=# insert into derived2  values (generate_series(1,100));
INSERT 0 100
postgres=# explain select sum(x) from base;
   QUERY PLAN
-
 Finalize Aggregate  (cost=12176.63..12176.64 rows=1 width=8)
   ->  Gather  (cost=12176.59..12176.61 rows=8 width=8)
 Workers Planned: 8
 ->  Partial Aggregate  (cost=12175.59..12175.60 rows=1 width=8)
   ->  Append  (cost=0.00..11510.47 rows=266048 width=4)
 ->  Parallel Seq Scan on base (cost=0.00..0.00 
rows=1 width=4)
 ->  Parallel Seq Scan on derived1 
(cost=0.00..5675.00 rows=125000 width=4)
 ->  Parallel Seq Scan on derived2 
(cost=0.00..5835.47 rows=141047 width=4)

(8 rows)

It is still far from ideal plan because each worker is working with all 
partitions, instead of spitting partitions between workers and calculate 
partial aggregates for each partition.


But if we add FDW as a child of parent table, then parallel scan can not 
be used and we get the worst possible plan:


postgres=# create foreign table derived_fdw() inherits(base) server 
pg_fdw options (table_name 'derived1');CREATE FOREIGN TABLE

postgres=# explain select sum(x) from base;
    QUERY PLAN
--
 Aggregate  (cost=34055.07..34055.08 rows=1 width=8)
   ->  Append  (cost=0.00..29047.75 rows=2002926 width=4)
 ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=4)
 ->  Seq Scan on derived1  (cost=0.00..14425.00 rows=100 
width=4)
 ->  Seq Scan on derived2  (cost=0.00..14425.00 rows=100 
width=4)
 ->  Foreign Scan on derived_fdw  (cost=100.00..197.75 
rows=2925 width=4)

(6 rows)

So we sequentially pull all data to this node and compute aggregates 
locally.
Ideal plan will calculate in parallel partial aggregates at all nodes 
and then combine partial results.

It requires two changes:
1. Replace Aggregate->Append with 
Finalize_Aggregate->Append->Partial_Aggregate
2. Concurrent execution of Append. It also can be done in two different 
ways: we can try to use existed parallel workers infrastructure and
replace Append with Gather. It seems to be the best approach for local 
partitioning. In case of remote (FDW) partitions, it is enough
to split starting of execution (PQsendQuery in postgres_fdw) and getting 
results. So it requires some changes in FDW protocol.



I wonder if somebody already investigate this problem or working in this 
direction.

May be there are already some patches proposed?
I have searched hackers archive, but didn't find something relevant...
Are there any suggestions about the best approach to implement this feature?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] SQL procedures

2017-11-08 Thread Konstantin Knizhnik



On 08.11.2017 17:23, Merlin Moncure wrote:

On Tue, Oct 31, 2017 at 12:23 PM, Peter Eisentraut
<peter.eisentr...@2ndquadrant.com> wrote:

- Transaction control in procedure bodies

This feature is really key, since it enables via SQL lots of things
that are not possible without external coding, including:
*) very long running processes in a single routine
*) transaction isolation control inside the procedure (currently
client app has to declare this)
*) certain error handling cases that require client side support
*) simple in-database threading
*) simple construction of daemon scripts (yeah, you can use bgworker
for this, but pure sql daemon with a cron heartbeat hook is hard to
beat for simplicity)

I do wonder how transaction control could be added later.

The last time I (lightly) looked at this, I was starting to think that
working transaction control into the SPI interface was the wrong
approach; pl/pgsql would have to adopt a very different set of
behaviors if it was called in a function or a proc.  If you restricted
language choice to purely SQL, you could work around this problem; SPI
languages would be totally abstracted from those sets of
considerations and you could always call an arbitrary language
function if you needed to.  SQL has no flow control but I'm not too
concerned about that.

merlin


I am also very interested in answer on this question: how you are going 
to implement transaction control inside procedure?
Right now in PostgresPRO EE supports autonomous transactions. Them are 
supported both for SQL and plpgsql/plpython APIs.
Them are implemented by saving/restoring transaction context, so unlike 
most of other ATX implementations, in pgpro autonomous
transaction is executed by the same backend. But it is not so easy to 
do: in Postgres almost any module have its own static variables which 
keeps transaction specific data.
So we have to provide a dozen of suspend/resume functions: 
SuspendSnapshot(),  SuspendPredicate(), SuspendStorage(), 
SuspendInvalidationInfo(), SuspendPgXact(), PgStatSuspend(), 
TriggerSuspend(), SuspendSPI()... and properly handle local cache 
invalidation. Patch consists of more than 5 thousand lines.


So my question is whether you are going to implement something similar 
or use completely different approach?
In first case it will be good to somehow unite our efforts... For 
example we can publish our ATX patch for Postgres 10.
We have not done it yet, because there seems to be no chances to push 
this patch to community.









--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Secondary index access optimizations

2017-11-06 Thread Konstantin Knizhnik

On 11/06/2017 04:27 AM, Thomas Munro wrote:

On Fri, Sep 8, 2017 at 3:58 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Updated version of the patch is attached to this mail.
Also I added support of date type to operator_predicate_proof to be able to
imply (logdate <= '2017-03-31') from (logdate < '2017-04-01') .

Hi Konstantin,

Is there any reason why you don't want to split this into two separate
proposals?  One for remove_restrictions_implied_by_constraints() and
one for the operator_predicate_proof() changes.

Your v3 patch breaks the new partition_join test (the recently
committed partition-wise join stuff), as far as I can tell in a good
way.  Can you please double check those changes and post an updated
patch?


Hi Thomas.

The primary idea of this patch was to provide more efficient plans for queries 
on partitioned tables.
So remove_restrictions_implied_by_constraints() removes redundant predicate 
checks.
But it doesn't work for standard Postgres 10 partitioning, because here 
constraints are set using intervals with open high boundary and original 
version of
operator_predicate_proof() is not able to handle this case.

I have explained this problem in my previous e-mails in this thread.
This is why I have changed operator_predicate_proof() to correctly handle this 
case.

If you think this patch should be splitted into two, ok: I can do it.
I just want to notice that without patching operator_predicate_proof() it may 
give not positive effect for standard partitioning,
which I expect to be the most popular use case where this optimization may have 
an effect.


Concerning broken partition_join test: it is "expected" failure: my patch 
removes from the plans redundant checks.
So the only required action is to update expected file with results.
Attached please find updated patch.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index 4339bbf..0931af1 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -626,12 +626,12 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- Nu
Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NULL))
 (3 rows)
 
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
- QUERY PLAN  
--
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
+QUERY PLAN
+--
  Foreign Scan on public.ft1 t1
Output: c1, c2, c3, c4, c5, c6, c7, c8
-   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NOT NULL))
+   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE ((c3 IS NOT NULL))
 (3 rows)
 
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index ddfec79..878bfc7 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -292,7 +292,7 @@ RESET enable_nestloop;
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 1; -- Var, OpExpr(b), Const
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 100 AND t1.c2 = 0; -- BoolExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- NullTest
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 = -c1;  -- OpExpr(l)
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE 1 = c1!;   -- OpExpr(r)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index a5c1b68..082d1cc 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -346,6 +346,7 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		switch (rel->rtekind)
 		{
 			case RTE_RELATION:
+remove_restrictions_implied_by_constraints(root, rel, rte);
 if (rte->relkind == RELKIND_FOREIGN_TABLE)
 {
 

[HACKERS] Deadlock in ALTER SUBSCRIPTION REFRESH PUBLICATION

2017-10-24 Thread Konstantin Knizhnik
CRIPTION 
sub_3_1 REFRESH PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, 
completionTag=0x73a43c90 "") at src/hooks.c:913
#14 0x008a80a2 in ProcessUtility (pstmt=0x24dd998, queryString=0x24dc920 
"SET SESSION synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH 
PUBLICATION",
context=PROCESS_UTILITY_QUERY, params=0x0, queryEnv=0x0, dest=0x24ddcb8, 
completionTag=0x73a43c90 "") at utility.c:353
#15 0x008a7075 in PortalRunUtility (portal=0x246be60, pstmt=0x24dd998, 
isTopLevel=0 '\000', setHoldSnapshot=0 '\000', dest=0x24ddcb8, 
completionTag=0x73a43c90 "") at pquery.c:1178
#16 0x008a728d in PortalRunMulti (portal=0x246be60, isTopLevel=0 '\000', 
setHoldSnapshot=0 '\000', dest=0x24ddcb8, altdest=0x24ddcb8, completionTag=0x73a43c90 
"") at pquery.c:1324
#17 0x008a6757 in PortalRun (portal=0x246be60, count=9223372036854775807, 
isTopLevel=0 '\000', run_once=1 '\001', dest=0x24ddcb8, altdest=0x24ddcb8, 
completionTag=0x73a43c90 "")
at pquery.c:799
#18 0x008a0288 in exec_simple_query (query_string=0x24dc920 "SET SESSION 
synchronous_commit TO local; ALTER SUBSCRIPTION sub_3_1 REFRESH PUBLICATION") at 
postgres.c:1099
#19 0x008a4823 in PostgresMain (argc=1, argv=0x247c2d0, dbname=0x247c2a8 
"postgres", username=0x244f870 "knizhnik") at postgres.c:4090
#20 0x00801753 in BackendRun (port=0x240) at postmaster.c:4357
#21 0x00800e5f in BackendStartup (port=0x240) at postmaster.c:4029
#22 0x007fd398 in ServerLoop () at postmaster.c:1753
#23 0x007fc92f in PostmasterMain (argc=3, argv=0x244d6e0) at 
postmaster.c:1361
#24 0x00734f08 in main (argc=3, argv=0x244d6e0) at main.c:228



The reason of this deadlock seems to be clear: ALTER SUBSCRIPTION starts 
transaction at one node and tries to create slot at other node, which waiting 
for completion of all active transaction while building scnapshpot.
Is there any way to avoid this deadlock?

The same deadlock can happen in concurrent CREATE SUBSCRIPTION with implicit 
slot creation. But here it can be sovled by explicit slot creation prior to 
CREATE SUBSCRIPTION.
But what can I do if I want to update publications and refresh subscriptions?

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-12 Thread Konstantin Knizhnik



On 12.10.2017 04:23, Craig Ringer wrote:

On 12 October 2017 at 00:57, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:


The reason of such behavior is obvious: wal sender has to decode huge
transaction generate by insert although it has no relation to this
publication.

It does. Though I wouldn't expect anywhere near the kind of drop you
report, and haven't observed it here.

Is the CREATE TABLE and INSERT done in the same transaction?


No. Table was create in separate transaction.
Moreover  the same effect will take place if table is create before 
start of replication.
The problem in this case seems to be caused by spilling decoded 
transaction to the file by ReorderBufferSerializeTXN.

Please look at two profiles:
http://garret.ru/lr1.svg  corresponds to normal work if pgbench with 
synchronous replication to one replica,
http://garret.ru/lr2.svg - the with concurrent execution of huge insert 
statement.


And here is output of pgbench (at fifth second insert is started):

progress: 1.0 s, 10020.9 tps, lat 0.791 ms stddev 0.232
progress: 2.0 s, 10184.1 tps, lat 0.786 ms stddev 0.192
progress: 3.0 s, 10058.8 tps, lat 0.795 ms stddev 0.301
progress: 4.0 s, 10230.3 tps, lat 0.782 ms stddev 0.194
progress: 5.0 s, 10335.0 tps, lat 0.774 ms stddev 0.192
progress: 6.0 s, 4535.7 tps, lat 1.591 ms stddev 9.370
progress: 7.0 s, 419.6 tps, lat 20.897 ms stddev 55.338
progress: 8.0 s, 105.1 tps, lat 56.140 ms stddev 76.309
progress: 9.0 s, 9.0 tps, lat 504.104 ms stddev 52.964
progress: 10.0 s, 14.0 tps, lat 797.535 ms stddev 156.082
progress: 11.0 s, 14.0 tps, lat 601.865 ms stddev 93.598
progress: 12.0 s, 11.0 tps, lat 658.276 ms stddev 138.503
progress: 13.0 s, 9.0 tps, lat 784.120 ms stddev 127.206
progress: 14.0 s, 7.0 tps, lat 870.944 ms stddev 156.377
progress: 15.0 s, 8.0 tps, lat .578 ms stddev 140.987
progress: 16.0 s, 7.0 tps, lat 1258.750 ms stddev 75.677
progress: 17.0 s, 6.0 tps, lat 991.023 ms stddev 229.058
progress: 18.0 s, 5.0 tps, lat 1063.986 ms stddev 269.361

It seems to be effect of large transactions.
Presence of several channels of synchronous logical replication reduce 
performance, but not so much.

Below are results at another machine and pgbench with scale 10.

Configuraion
standalone
1 async logical replica
1 sync logical replca
3 async logical replicas
3 syn logical replicas
TPS
15k
13k
10k
13k
8k





Only partly true. The output plugin can register a transaction origin
filter and use that to say it's entirely uninterested in a
transaction. But this only works based on filtering by origins. Not
tables.
Yes I know about origin filtering mechanism (and we are using it in 
multimaster).
But I am speaking about standard pgoutput.c output plugin. it's 
pgoutput_origin_filter

always returns false.




I imagine we could call another hook in output plugins, "do you care
about this table", and use it to skip some more work for tuples that
particular decoding session isn't interested in. Skip adding them to
the reorder buffer, etc. No such hook currently exists, but it'd be an
interesting patch for Pg11 if you feel like working on it.


Unfortunately it is not quite clear how to make wal-sender smarter and let
him skip transaction not affecting its publication.

As noted, it already can do so by origin. Mostly. We cannot totally
skip over WAL, since we need to process various invalidations etc. See
ReorderBufferSkip.
The problem is that before end of transaction we do not know whether it 
touch this publication or not.

So filtering by origin will not work in this case.

I really not sure that it is possible to skip over WAL. But the 
particular problem with invalidation records etc  can be solved by 
always processing this records by WAl sender.
I.e. if backend is inserting invalidation record or some other record 
which always should be processed by WAL sender, it can always promote 
LSN of this record to WAL sender.
So WAl sender will skip only those WAl records which is safe to skip 
(insert/update/delete records not affecting this publication).


I wonder if there can be some other problems with skipping part of 
transaction by WAL sender.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Slow synchronous logical replication

2017-10-11 Thread Konstantin Knizhnik



On 11.10.2017 10:07, Craig Ringer wrote:

On 9 October 2017 at 15:37, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.


It will be not so easy to provide some reproducing scenario, because
actually it involves many components (postgres_fdw, pg_pasthman,
pg_shardman, LR,...)

So simplify it to a test case that doesn't.

The simplest reproducing scenario is the following:
1. Start two Posgtgres instances: synchronous_commit=on, fsync=off
2. Initialize pgbench database at both instances: pgbench -i
3. Create publication for pgbench_accounts table at one node
4. Create correspondent subscription at another node with 
copy_data=false parameter

5. Add subscription to synchronous_standby_names at first node.
6. Start pgbench -c 8 -N -T 100 -P 1 at first node. At my systems 
results are the following:

standalone postgres: 8600 TPS
asynchronous replication: 6600 TPS
synchronous replication:   5600 TPS
Quite good results.
7. Create some dummy table and perform bulk insert in it:
create table dummy(x integer primary key);
insert into dummy values (generate_series(1,1000));

pgbench almost stuck: until end of insert performance drops almost 
to zero.


The reason of such behavior is obvious: wal sender has to decode huge 
transaction generate by insert although it has no relation to this 
publication.
Filtering of insert records of this transaction is done only inside 
output plug-in.
Unfortunately it is not quite clear how to make wal-sender smarter and 
let him skip transaction not affecting its publication.
Once of the possible solutions is to let backend inform wal-sender about 
smallest LSN it should wait for (backend knows which table is affected 
by current operation,
so which publications are interested in this operation and so can point 
wal -sender to the proper LSN without decoding huge part of WAL.

But it seems to be not so easy to implement.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Columnar storage support

2017-10-10 Thread Konstantin Knizhnik
Unfortunately C-Store doesn't allow to take all advantages of columnar 
store: you still not be able to perform vector operation.s

C-Store allows to reduce size of data read from the disk because of
1. fetching only columns which are used in the query,
2. data compression.

It will lead to some benefits in query execution speed for cold data 
(when it is not loaded in cache).
For warm data there is almost no difference (except very huge tables 
which can not fit in memory).


But the main advantage of vertical data format - vector data processing 
- is possible only with specialized executor.

There is prototype of vector executor for C-Store:
https://github.com/citusdata/postgres_vectorization_test
It provides 3-4x speedup of some queries, but it is really prototype and 
research project, for from practical usage.


I have also developed two columnar store extensions for Postgres:
IMCS (In-Memory-Columnar-Store): https://github.com/knizhnik/imcs.git
VOPS (Vectorized Operations): https://github.com/postgrespro/vops.git

First one is more oriented on in-memory databases (although support 
spilling data to the disk) and requires to use special functions to 
manipulate with columnar data.
In this case columnar store is copy of main (horizontal) store (normal 
Postgres tables).


VOPS is more recent work, allowing to use more or less normal SQL (using 
foreign data wrapper and user defined types/operators).
In VOPS data is stored inside normal Postgres tables, but using vectors 
(tiles) instead of scalars.


Both IMCS and VOPS provides 10-100 times speed improvement on queries 
like Q1 in TPC-H (sequential scan with filtering and aggregation).
In queries involving joins there is almost no benefit comparing with 
normal Postgres.


There is also columnar storage extension developed by Fujitsu:
https://www.postgresql.org/message-id/cajrrpgfac7wc9nk6ptty6yn-nn+hcy8xolah2doyhvg5d6h...@mail.gmail.com
But the published patch is only first step in this direction and it is 
not possible neither to use it in practice, neither perform some 
experiments measuring possible improvement of performance.



On 09.10.2017 23:06, Joshua D. Drake wrote:

On 10/09/2017 01:03 PM, legrand legrand wrote:
Is there a chance that pluggable storage permits creation of a 
columnar rdbms

as monetDB in PostgreSQL ?
Thanks un advance for thé answer


The extension C-Store from Citus is probably what you are looking for.

jD





--
Sent from: 
http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html








--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-09 Thread Konstantin Knizhnik

Thank you for explanations.

On 08.10.2017 16:00, Craig Ringer wrote:

I think it'd be helpful if you provided reproduction instructions,
test programs, etc, making it very clear when things are / aren't
related to your changes.


It will be not so easy to provide some reproducing scenario, because 
actually it involves many components (postgres_fdw, pg_pasthman, 
pg_shardman, LR,...)

and requires multinode installation.
But let me try to explain what going on:
So we have implement sharding - splitting data between several remote 
tables using pg_pathman and postgres_fdw.
It means that insert or update of parent table  cause insert or update 
of some derived partitions which is forwarded by postgres_fdw to the 
correspondent node.
Number of shards is significantly larger than number of nodes, i.e. for 
5 nodes we have 50 shards. Which means that at each onde we have 10 shards.
To provide fault tolerance each shard is replicated using logical 
replication to one or more nodes. Right now we considered only 
redundancy level 1 - each shard has only one replica.

So from each node we establish 10 logical replication channels.

We want commit to wait until data is actually stored at all replicas, so 
we are using synchronous replication:
So we set synchronous_commit option to "on" and include all ten 10 
subscriptions in synchronous_standby_names list.


In this setup commit latency is very large (about 100msec and most of 
the time is actually spent in commit) and performance is very bad - 
pgbench shows about 300 TPS for optimal number of clients (about 10, for 
larger number performance is almost the same). Without logical 
replication at the same setup we get about 6000 TPS.


I have checked syncrepl.c file, particularly SyncRepGetSyncRecPtr 
function. Each wal sender independently calculates minimal LSN among all 
synchronous replicas and wakeup backends waiting for this LSN. It means 
that transaction performing update of data in one shard will actually 
wait confirmation from replication channels for all shards.
If some shard is updated rarely than other or is not updated at all (for 
example because communication channels between this node is broken), 
then all backens will stuck.
Also all backends are competing for the single SyncRepLock, which also 
can be a contention point.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Slow synchronous logical replication

2017-10-07 Thread Konstantin Knizhnik

On 10/07/2017 10:42 PM, Andres Freund wrote:

Hi,

On 2017-10-07 22:39:09 +0300, konstantin knizhnik wrote:

In our sharded cluster project we are trying to use logical relication for 
providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. 
This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of 
redundancy=1 (just one copy) cluster is 20 times slower then without logical 
replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous replication 
mechanism was originally developed for streaming replication, when all replicas have the same 
content and LSNs. When it is used for logical replication, it behaves very inefficiently. 
Commit has to wait confirmations from all receivers mentioned in 
"synchronous_standby_names" list. So we are waiting not only for our own single 
logical replication standby, but all other standbys as well. Number of synchronous standbyes is 
equal to number of shards divided by number of nodes. To provide uniform distribution number of 
shards should >> than number of nodes, for example for 10 nodes we usually create 100 
shards. As a result we get awful performance and blocking of any replication channel blocks all 
backends.

So my question is whether my understanding is correct and synchronous logical 
replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous 
replication mechanism for logical replication more efficient and are there some 
plans to  work in this direction?

This seems to be a question that is a) about a commercial project we
don't know much about b) hasn't received a lot of investigation.


Sorry, If I was not clear.
The question was about logical replication mechanism in mainstream version of 
Postgres.
I think that most of people are using asynchronous logical replication and 
synchronous LR is something exotic and not well tested and investigated.
It will be great if I am wrong:)

Concerning our sharded cluster (pg_shardman) - it is not a commercial product 
yet, it is in development phase.
We are going to open its sources when it will be more or less stable.
But unlike multimaster, this sharded cluster is mostly built from existed 
components: pg_pathman  + postgres_fdw + logical replication.
So we are just trying to combine them all into some integrated system.
But currently the most obscure point is logical replication.

And the main goal of my e-mail was to know the opinion of authors and users of 
LR whether it is good idea to use LR to provide fault tolerance in sharded 
cluster.
Or some other approaches, for example sharding with redundancy or using 
streaming replication are preferable?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Slow synchronous logical replication

2017-10-07 Thread konstantin knizhnik
In our sharded cluster project we are trying to use logical relication for 
providing HA (maintaining redundant shard copies).
Using asynchronous logical replication has not so much sense in context of HA. 
This is why we try to use synchronous logical replication.
Unfortunately it shows very bad performance. With 50 shards and level of 
redundancy=1 (just one copy) cluster is 20 times slower then without logical 
replication.
With asynchronous replication it is "only" two times slower.

As far as I understand, the reason of such bad performance is that synchronous 
replication mechanism was originally developed for streaming replication, when 
all replicas have the same content and LSNs. When it is used for logical 
replication, it behaves very inefficiently. Commit has to wait confirmations 
from all receivers mentioned in "synchronous_standby_names" list. So we are 
waiting not only for our own single logical replication standby, but all other 
standbys as well. Number of synchronous standbyes is equal to number of shards 
divided by number of nodes. To provide uniform distribution number of shards 
should >> than number of nodes, for example for 10 nodes we usually create 100 
shards. As a result we get awful performance and blocking of any replication 
channel blocks all backends.

So my question is whether my understanding is correct and synchronous logical 
replication can not be efficiently used in such manner.
If so, the next question is how difficult it will be to make synchronous 
replication mechanism for logical replication more efficient and are there some 
plans to  work in this direction?

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-07 Thread Konstantin Knizhnik

On 10/07/2017 04:26 PM, Petr Jelinek wrote:


Hmm so you start transaction (you have to when running
CREATE_REPLICATION_SLOT with USE_SNAPSHOT parameter). And while the slot
is being created the config is reloaded. And since now you are in
transaction the tsearch hook for GUC processing tries to access catalogs
which sets the xmin for the transaction.


Actually this slot is implicitly created by LogicalRepSyncTableStart to perform 
initial data sync.



That's not good, but I can't really say I have idea about what to do
with it other than to set some kind of flag saying that logical decoding
snapshot is being built and using that to skip catalog access which does
not seem very pretty.


It is not quite clear from the comment why it is not possible to overwrite 
MyPgXact->xmin or just use existed value.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-06 Thread Konstantin Knizhnik



On 06.10.2017 15:29, Petr Jelinek wrote:

On 06/10/17 12:16, Konstantin Knizhnik wrote:

When creating logical replication slots we quite often get the following
error:

ERROR:  cannot build an initial slot snapshot when MyPgXact->xmin
already is valid

which cause restart of WAL sender.
The comment to this line doesn't clarify much:

 /* so we don't overwrite the existing value */
 if (TransactionIdIsValid(MyPgXact->xmin))
 elog(ERROR, "cannot build an initial slot snapshot when
MyPgXact->xmin already is valid");
I wonder if it is normal situation or something goes wrong?


Hi,

no it's not normal situation, it seems you are doing something that
assigns xid before you run the CREATE_REPLICATION_SLOT command on that
connection.


I have not doing something in this connection: it is wal sender 
executing CREATE_REPLICATION_SLOT replication command.
Please look at the stack in my original e-mail. It shows who and when is 
setting MyPgXact->xmin.

It is GetSnapshotData called because of reloading configuration settings.
And configuration setting are reloaded because our application is 
updating "synchronous_standby_names".



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Issue with logical replication: MyPgXact->xmin already is valid

2017-10-06 Thread Konstantin Knizhnik
When creating logical replication slots we quite often get the following 
error:


ERROR:  cannot build an initial slot snapshot when MyPgXact->xmin 
already is valid


which cause restart of WAL sender.
The comment to this line doesn't clarify much:

/* so we don't overwrite the existing value */
if (TransactionIdIsValid(MyPgXact->xmin))
elog(ERROR, "cannot build an initial slot snapshot when 
MyPgXact->xmin already is valid");



I have checked that MyPgXact->xmin is set by GetSnapshotData:

#3  0x0086025e in GetSnapshotData (snapshot=0xf28040 
) at procarray.c:1714
#4  0x00a48523 in GetNonHistoricCatalogSnapshot (relid=2615) at 
snapmgr.c:479

#5  0x00a484d3 in GetCatalogSnapshot (relid=2615) at snapmgr.c:452
#6  0x004f15bf in systable_beginscan (heapRelation=0x256bdb0, 
indexId=2684, indexOK=1 '\001', snapshot=0x0, nkeys=1, 
key=0x7ffdf633c770) at genam.c:353
#7  0x009d676e in SearchCatCache (cache=0x25230a8, v1=39586984, 
v2=0, v3=0, v4=0) at catcache.c:1169
#8  0x009ec13d in SearchSysCache (cacheId=35, key1=39586984, 
key2=0, key3=0, key4=0) at syscache.c:1109
#9  0x009ec259 in GetSysCacheOid (cacheId=35, key1=39586984, 
key2=0, key3=0, key4=0) at syscache.c:1187
#10 0x00574b85 in get_namespace_oid (nspname=0x25c0ca8 
"pg_catalog", missing_ok=1 '\001') at namespace.c:3009
#11 0x00574886 in LookupExplicitNamespace (nspname=0x25c0ca8 
"pg_catalog", missing_ok=1 '\001') at namespace.c:2871
#12 0x0057437d in get_ts_config_oid (names=0x25c2438, 
missing_ok=1 '\001') at namespace.c:2653
#13 0x009f56ca in check_TSCurrentConfig (newval=0x7ffdf633cb90, 
extra=0x7ffdf633cba0, source=PGC_S_FILE) at ts_cache.c:600
#14 0x00a1bbdc in call_string_check_hook (conf=0xf26870 
<ConfigureNamesString+6992>, newval=0x7ffdf633cb90, 
extra=0x7ffdf633cba0, source=PGC_S_FILE, elevel=12) at guc.c:9912
#15 0x00a14420 in parse_and_validate_value (record=0xf26870 
<ConfigureNamesString+6992>, name=0x25c0620 
"default_text_search_config", value=0x25c0658 "pg_catalog.english", 
source=PGC_S_FILE, elevel=12,

newval=0x7ffdf633cb90, newextra=0x7ffdf633cba0) at guc.c:5840
#16 0x00a15543 in set_config_option (name=0x25c0620 
"default_text_search_config", value=0x25c0658 "pg_catalog.english", 
context=PGC_SIGHUP, source=PGC_S_FILE, action=GUC_ACTION_SET, 
changeVal=1 '\001', elevel=12,

is_reload=0 '\000') at guc.c:6442
#17 0x00a1eb5c in ProcessConfigFileInternal (context=PGC_SIGHUP, 
applySettings=1 '\001', elevel=13) at guc-file.l:439
#18 0x00a1e4ee in ProcessConfigFile (context=PGC_SIGHUP) at 
guc-file.l:155
#19 0x0082433c in WalSndWaitForWal (loc=25991904) at 
walsender.c:1309
#20 0x00823423 in logical_read_xlog_page (state=0x25a4f10, 
targetPagePtr=25985024, reqLen=6880, targetRecPtr=25991880, 
cur_page=0x25a6c60 "\227\320\005", pageTLI=0x25a57bc) at walsender.c:761
#21 0x00558c3d in ReadPageInternal (state=0x25a4f10, 
pageptr=25985024, reqLen=6880) at xlogreader.c:556
#22 0x00558405 in XLogReadRecord (state=0x25a4f10, 
RecPtr=25991880, errormsg=0x7ffdf633cea8) at xlogreader.c:255
#23 0x0080dda6 in DecodingContextFindStartpoint (ctx=0x25a4c50) 
at logical.c:450
#24 0x00823a0c in CreateReplicationSlot (cmd=0x24dc398) at 
walsender.c:934
#25 0x008247e4 in exec_replication_command (cmd_string=0x254e8f0 
"CREATE_REPLICATION_SLOT \"shardman_copy_t_10_3_4_17307_sync_17302\" 
TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at walsender.c:1515
#26 0x0088eccc in PostgresMain (argc=1, argv=0x24f0b28, 
dbname=0x24f0948 "postgres", username=0x24bf7f0 "knizhnik") at 
postgres.c:4086

#27 0x007ef9e2 in BackendRun (port=0x24dee00) at postmaster.c:4357
#28 0x007ef10c in BackendStartup (port=0x24dee00) at 
postmaster.c:4029

#29 0x007eb6cc in ServerLoop () at postmaster.c:1753
#30 0x007eacb8 in PostmasterMain (argc=3, argv=0x24bd660) at 
postmaster.c:1361

#31 0x00728593 in main (argc=3, argv=0x24bd660) at main.c:228


I wonder if it is normal situation or something goes wrong?





--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Issues with logical replication

2017-10-02 Thread Konstantin Knizhnik

I have faced two issues with logical replication.
Repro scenario:

1. start two Postgres instances (I start both at local machine).
2. Initialize pgbench tables at both instances:
pgbench -i -s 10 postgres
3. Create publication of pgbench_accounts table at one node:
create publication pub for table pgbench_accounts;
4. Create subscription at another node:
delete from pgbench_accounts;
CREATE SUBSCRIPTION sub connection 'dbname=postgres host=localhost 
port=5432 sslmode=disable' publication pub;
CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS 
TRIGGER AS $$

BEGIN
--  RETURN NEW;
END $$ LANGUAGE plpgsql;
   CREATE TRIGGER replication_trigger BEFORE INSERT OR UPDATE OR DELETE 
ON pgbench_accounts FOR EACH ROW EXECUTE PROCEDURE 
replication_trigger_proc();

   ALTER TABLE pgbench_accounts ENABLE ALWAYS TRIGGER replication_trigger;
5. Start pgbench at primary node:
pgbench -T 1000 -P 2 -c 10 postgres


Please notice commented "return new" statement. Invocation of this 
function cause error and in log we see repeated messages:


2017-10-02 16:47:46.764 MSK [32129] LOG:  logical replication table 
synchronization worker for subscription "sub", table "pgbench_accounts" 
has started
2017-10-02 16:47:46.771 MSK [32129] ERROR:  control reached end of 
trigger procedure without RETURN
2017-10-02 16:47:46.771 MSK [32129] CONTEXT:  PL/pgSQL function 
replication_trigger_proc()

COPY pgbench_accounts, line 1: "110 "
2017-10-02 16:47:46.772 MSK [28020] LOG:  worker process: logical 
replication worker for subscription 17264 sync 17251 (PID 32129) exited 
with exit code 1

...


After few minutes of work primary node (with publication) is crashed 
with the following stack trace:


Program terminated with signal SIGABRT, Aborted.
#0  0x7f3608f8ec37 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

56../nptl/sysdeps/unix/sysv/linux/raise.c: No such file or directory.
(gdb) bt
#0  0x7f3608f8ec37 in __GI_raise (sig=sig@entry=6) at 
../nptl/sysdeps/unix/sysv/linux/raise.c:56

#1  0x7f3608f92028 in __GI_abort () at abort.c:89
#2  0x009f5740 in ExceptionalCondition (conditionName=0xbf6b30 
"!(((xid) != ((TransactionId) 0)))",
errorType=0xbf69af "FailedAssertion", fileName=0xbf69a8 "lmgr.c", 
lineNumber=582) at assert.c:54
#3  0x0086ac1d in XactLockTableWait (xid=0, rel=0x0, ctid=0x0, 
oper=XLTW_None) at lmgr.c:582
#4  0x0081c9f7 in SnapBuildWaitSnapshot (running=0x2749198, 
cutoff=898498) at snapbuild.c:1400
#5  0x0081c7c7 in SnapBuildFindSnapshot (builder=0x2807910, 
lsn=798477760, running=0x2749198) at snapbuild.c:1311
#6  0x0081c339 in SnapBuildProcessRunningXacts 
(builder=0x2807910, lsn=798477760, running=0x2749198) at snapbuild.c:1106
#7  0x0080a1c7 in DecodeStandbyOp (ctx=0x27ef870, 
buf=0x7ffd301858d0) at decode.c:314
#8  0x00809ce1 in LogicalDecodingProcessRecord (ctx=0x27ef870, 
record=0x27efb30) at decode.c:117
#9  0x0080ddf9 in DecodingContextFindStartpoint (ctx=0x27ef870) 
at logical.c:458
#10 0x00823968 in CreateReplicationSlot (cmd=0x27483a8) at 
walsender.c:934

#11 0x008246ee in exec_replication_command (
cmd_string=0x27b9520 "CREATE_REPLICATION_SLOT 
\"sub_17264_sync_17251\" TEMPORARY LOGICAL pgoutput USE_SNAPSHOT") at 
walsender.c:1511
#12 0x0088eb44 in PostgresMain (argc=1, argv=0x275b738, 
dbname=0x275b578 "postgres", username=0x272b800 "knizhnik")

at postgres.c:4086
#13 0x007ef9a1 in BackendRun (port=0x27532a0) at postmaster.c:4357
#14 0x007ef0cb in BackendStartup (port=0x27532a0) at 
postmaster.c:4029

#15 0x007eb68b in ServerLoop () at postmaster.c:1753
#16 0x007eac77 in PostmasterMain (argc=3, argv=0x2729670) at 
postmaster.c:1361

#17 0x00728552 in main (argc=3, argv=0x2729670) at main.c:228


Now fix the trigger function:
CREATE OR REPLACE FUNCTION replication_trigger_proc() RETURNS TRIGGER AS $$
BEGIN
  RETURN NEW;
END $$ LANGUAGE plpgsql;

And manually perform at master two updates inside one transaction:

postgres=# begin;
BEGIN
postgres=# update pgbench_accounts set abalance=abalance+1 where aid=26;
UPDATE 1
postgres=# update pgbench_accounts set abalance=abalance-1 where aid=26;
UPDATE 1
postgres=# commit;


and in replcas log we see:
2017-10-02 18:40:26.094 MSK [2954] LOG:  logical replication apply 
worker for subscription "sub" has started

2017-10-02 18:40:26.101 MSK [2954] ERROR:  attempted to lock invisible tuple
2017-10-02 18:40:26.102 MSK [2882] LOG:  worker process: logical 
replication worker for subscription 16403 (PID 2954) exited with exit code 1


Error happens in trigger.c:
#3  0x0069bddb in GetTupleForTrigger (estate=0x2e36b50, 
epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tid=0x2dd08ac,
lockmode=LockTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at 
trigger.c:3103
#4  0x0069b259 in ExecBRUpdateTriggers 

[HACKERS] Issues with logical replication

2017-10-02 Thread Konstantin Knizhnik
ckTupleNoKeyExclusive, newSlot=0x7ffc4420ec40) at 
trigger.c:3103
#4  0x0069b259 in ExecBRUpdateTriggers (estate=0x2e36b50, 
epqstate=0x7ffc4420eda0, relinfo=0x2dcfe90, tupleid=0x2dd08ac,

fdw_trigtuple=0x0, slot=0x2dd0240) at trigger.c:2748
#5  0x006d2395 in ExecSimpleRelationUpdate (estate=0x2e36b50, 
epqstate=0x7ffc4420eda0, searchslot=0x2dd0358, slot=0x2dd0240)

at execReplication.c:461
#6  0x00820894 in apply_handle_update (s=0x7ffc442163b0) at 
worker.c:736

#7  0x00820d83 in apply_dispatch (s=0x7ffc442163b0) at worker.c:892
#8  0x0082121b in LogicalRepApplyLoop (last_received=2225693736) 
at worker.c:1095

#9  0x0082219d in ApplyWorkerMain (main_arg=0) at worker.c:1647
#10 0x007dd496 in StartBackgroundWorker () at bgworker.c:835
#11 0x007f0889 in do_start_bgworker (rw=0x2d75f50) at 
postmaster.c:5680

#12 0x007f0bc3 in maybe_start_bgworkers () at postmaster.c:5884
#13 0x007efbeb in sigusr1_handler (postgres_signal_arg=10) at 
postmaster.c:5073

#14 
#15 0x7fbe83a054a3 in __select_nocancel () at 
../sysdeps/unix/syscall-template.S:81

#16 0x007eb517 in ServerLoop () at postmaster.c:1717
#17 0x007eac48 in PostmasterMain (argc=3, argv=0x2d4e660) at 
postmaster.c:1361

#18 0x00728523 in main (argc=3, argv=0x2d4e660) at main.c:228

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] alter server for foreign table

2017-09-30 Thread konstantin knizhnik

On Sep 30, 2017, at 10:58 PM, Andrew Dunstan wrote:

> 
> 
> On 09/30/2017 05:14 AM, Derry Hamilton wrote:
>> Just to say, yes, this would be handy. I've been using a variant of
>> that hack on reporting servers, while migrating systems from
>> proprietary databases.  It behaves quite gracefully when there are
>> incompatible options, and it fixes up properly with DROPs as the first
>> options.
>> 
>> 
> 
> 
> I assume the proposal is to allow changing to a different server using
> the same FDW. I can see all sorts of odd things happening if we allow
> changing to a server of a different FDW.

Actually what I need is to handle a move of a shard (partition) to other node.
I can not use "alter server" to change connection string, because this server 
is used for many shards located at this node.
And I do not want to drop and recreate foreign table because it seems to be 
very complicated. 

So I need to change server of the same FDW.
But in theory I can imagine situation when partition is moved to another 
database (from MySQL to Postgres for example). In this case we need to change 
FDW.
What can be wrong with changing FDW? All checks that FDW is consistent with 
foreign table can be redone...




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] pg_prepared_xact_status

2017-09-30 Thread konstantin knizhnik

On Sep 29, 2017, at 11:33 PM, Robert Haas wrote:

> On Fri, Sep 29, 2017 at 4:22 AM, Craig Ringer  wrote:
>> This sounds kind-of like 1/4 of a distributed transaction resolver, without
>> a way to make it reliable enough to build the other 3/4.
>> 
>> To make this practical I think you'd need a way to retain historic GIDs +
>> their outcomes, and a way to prune that information only once an application
>> knows all interested participants consider the transaction finalized.
>> 
>> I'd be all for a global xid status function if there were a good way to
>> manage resource retention. But it's fuzzy enough for txid_status, which
>> isn't really making any firm promises, just improving on the prior state of
>> "no f'ing idea what happened to that tx, sorry". 2PC consumers will want
>> absolute guarantees, not "dunno, sorry".
> 
> Very well said, and I agree.

txid_status() also not always be able to return status of transaction (if 
wraparound happen).
But it is still considered as one of the key features of 10 (transaction 
traceability...).

> 
> I think the approach this patch takes is a non-starter for exactly the
> reasons you have stated.

Actually I do not propose pg_prepared_xact_status as mechanism for constructing 
GTM or some other complete 2PC infrastructure.
It is just simple function, using existed PostgreSQL log iteration utilities, 
simplifying extraction of information about 2PC transactions.
The same think can be done manually using pg_waldump. But it is very 
inconvenient.
So I do not see any troubles caused by adding this functions. And it can really 
be helpful for DBA in some cases.



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] alter server for foreign table

2017-09-29 Thread Konstantin Knizhnik
According to Postgresql documentation it is not possible to alter server 
for foreign table:


https://www.postgresql.org/docs/10/static/sql-alterforeigntable.html

But stackoverflow suggests the following hack directly updating 
|pg_foreign_table|:

https://stackoverflow.com/questions/37388756/may-i-alter-server-for-foreign-table

I wonder how safe it is and if it is so simple, why it is not support in 
ALTER FOREIGN TABLE statement?


Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] pg_prepared_xact_status

2017-09-29 Thread Konstantin Knizhnik



On 29.09.2017 11:27, Craig Ringer wrote:
On 29 September 2017 at 15:57, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:


So you are saying that Postgresql 2PC mechanism is not complete
and user needs to maintain some extra information to make it work?


No, it provides what's needed for an implementation of what in XA 
terms is a local resource manager (LRM). What it does not provide is 
infrastructure to make postgres its self into a global transaction 
manager (GTM) for co-ordinating multiple LRMs.


It sounds like you're trying to build a GTM using PostgreSQL's 
existing LRM book-keeping as your authorative data store, right?


No exactly. I am trying to add 2PC to our pg_shardman: combination of 
pg_pathman + postgres_fdw + logical replication, which should provide HA 
and write scalability.
This architecture definitely not assume presence of GTM. Most of 
transactions are expected to be local (involves only one node) and 
number of participants of distributed transaction is expected to be much 
smaller than total number of nodes (usually 2). So we need to perform 
2PC without GTM.



The problems with 2PC arrive when coordinator node is not
available but is expected to be recovered in future.
In this case we may have not enough information to make a decision
whether to abort or commit prepared transaction.
But it is a different story. We need to use 3PC or some other
protocol to prevent such situation.


In that case the node sits and waits patiently for the GTM (or in more 
complex architectures, *a* valid voting quorum of GTMs) to be 
reachable again. Likely using a protocol like Raft, Paxos, 3PC etc to 
co-ordinate.


It can't do anything else, since if it unilaterally commits or rolls 
back it might later find out that the nodes on the other side of the 
network partition or whatever made the opposite decision and, boom!


Ok, I am not sure if  pg_prepared_xact_status can be really useful or not.
I agree with you that if we are implementing distributed transaction on 
top of Poasgres, then we need some better mechanism to determine 
transaction state.
But a lot of people are using 2PC without GTM or whatever else. For 
example, many Java ORMs are using 2PC for their transactions.
I think that it is better to provide to DBA or programmer some way to 
determine status of such transaction by GID (which is usually unique and 
known), as far as this information

is available in Postgres WAL.

In any case, I attached slightly improved version of this function which 
traverse log not only since last checkpoint, but also try iterates 
backward inspecting previous WAL segments.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ae83291..fbf91f5 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -84,6 +84,7 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlog_internal.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
@@ -2408,3 +2409,106 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 
 	return;
 }
+
+Datum
+pg_prepared_xact_status(PG_FUNCTION_ARGS)
+{
+char const* gid = PG_GETARG_CSTRING(0);
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+	XLogRecPtr start_lsn;
+	XLogRecPtr lsn;
+	char const* xact_status = "unknown";
+	bool done = false;
+	TimeLineID timeline;
+	TransactionId xid = InvalidTransactionId;
+	XLogRecPtr wal_end = GetFlushRecPtr();
+
+	GetOldestRestartPoint(_lsn, );
+
+	xlogreader = XLogReaderAllocate(_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+(errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+	while (true)
+	{
+		lsn = start_lsn;
+		do
+		{
+			record = XLogReadRecord(xlogreader, lsn, );
+			if (record == NULL)
+break;
+			lsn = InvalidXLogRecPtr; /* continue after the record */
+			if (XLogRecGetRmid(xlogreader) == RM_XACT_ID)
+			{
+uint32 info = XLogRecGetInfo(xlogreader);
+switch (info & XLOG_XACT_OPMASK)
+{
+case XLOG_XACT_PREPARE:
+	{
+		TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *)XLogRecGetData(xlogreader);
+		char* xact_gid = (char*)hdr + MAXALIGN(sizeof(TwoPhaseFileHeader));
+ 		if (strcmp(xact_gid, gid) == 0)
+		{
+			xid = hdr->xid;
+			xact_status = "prepared";
+		}
+		break;
+	}
+case XLOG_XACT_COMMIT_PREPARED:
+	{
+		xl_xact_commit *xlrec;
+		xl_xact_parsed_commit parsed;
+
+		xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);

Re: [HACKERS] Index expression syntax

2017-09-29 Thread Konstantin Knizhnik



On 29.09.2017 11:03, Marko Tiikkaja wrote:
On Fri, Sep 29, 2017 at 9:31 AM, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:


I wonder why syntax error is produced in this case:

postgres=# create index metaindex on foo using
gin(to_tsvector('english', x)||to_tsvector('english',y));
ERROR:  syntax error at or near "||"
LINE 1: ...taindex on foo using gin(to_tsvector('english',
x)||to_tsvec...
^
[ .. ]

/|expression:|/
An expression based on one or more columns of the table. The
expression usually must be written with surrounding
parentheses, as shown in the syntax. However, the parentheses
can be omitted if the expression has the form of a function call.

So documentations states that sometimes it is possible to avoid
parentheses, but it is unclear why I have to use double
parentheses...
I think that either grammar should be fixed, either documentation
should be updated.


Your expression is clearly not a function call, it's a concatenation 
of two of them.  The documentation seems perfectly accurate to me?



O, sorry!
You are right. I just didn't notice extra parenthesis in CREATE INDEX 
syntax in case of using expressions.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] pg_prepared_xact_status

2017-09-29 Thread Konstantin Knizhnik



On 29.09.2017 06:02, Michael Paquier wrote:

On Fri, Sep 29, 2017 at 1:53 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

In Postgres 10 we have txid_status function which returns status of
transaction by XID.
I wonder if it will be also useful to have similar function for 2PC
transactions which can operate with GID?
pg_prepared_xacts view allows to get information about prepared transaction
which are not yet committed or aborted.
But if transaction is committed, then there is no way now to find status of
this transaction.

But you need to keep track of the transaction XID of each transaction
happening on the remote nodes which are part of a global 2PC
transaction, no?


Why? We have GID which allows to identify 2PC transaction at all 
participant nodes.



  If you have this data at hand using txid_status is
enough to guess if a prepared transaction has been marked as committed
or prepared. And it seems to me that tracking those XIDs is mandatory
anyway for other consistency checks.


It is certainly possible to maintain information about XIDs involved in 
2PC transaction.

And it can really simplify recovery. But I wonder why it is mandatory?
Keeping track of XIDs requires some persistent storage.
So you are saying that Postgresql 2PC mechanism is not complete and user 
needs to maintain some extra information to make it work?


Also, I think that it is not necessary to know XIDs of all local 
transactions involved in 2PC. It is enough to know XID of coordinator's 
transaction.
It can be included in GID (as I proposed in the end of my mail). In this 
case txid_status can be used at coordinator to check global status of 
2PC transaction.


The idea of pg_prepared_xact_status function is that it allows to get 
status of 2PC transaction without any additional requirements to GIDs 
and any other additional information about participants of 2PC transaction.






If crash happen during 2PC commit, then transaction can be in prepared state
at some nodes and committed/aborted at  other nodes.

Handling inconsistencies here is a tricky problem, particularly if a
given transaction is marked as both committed and aborted on many
nodes.

How it can be?
Abort of transaction can happen only at prepare stage.
In this case coordinator should rollback transaction everywhere.
There should be no committed transactions in this case.

The following situations are possible:
1. Transaction is prepared at some nodes and information about it is not 
available at other nodes. It means that crash happen at prepare state 
and transaction was not able to

complete prepare at all nodes. It is safe to abort transaction in this case.
2. Transaction is prepared at some nodes and aborted at another nodes. 
The same as 1 - we can safely abort transaction everywhere.
3. Transaction is prepared at all nodes. It means that coordinator was 
crashed before sending commit message. It is safe to commit transaction 
everywhere.
4. Transaction is prepared at some nodes and committed at other nodes. 
Commit message was no delivered or proceeded by other nodes before crash.

It is safe to commit transaction at all nodes.


The problems with 2PC arrive when coordinator node is not available but 
is expected to be recovered in future.
In this case we may have not enough information to make a decision 
whether to abort or commit prepared transaction.
But it is a different story. We need to use 3PC or some other protocol 
to prevent such situation.



The only way that I could think of would be to perform PITR to
recover from the inconsistent states. So that's not an easy problem,
becoming even more tricky if more than one transaction is involved and
many transactions are inter-dependent across nodes.


3. Same GID can be reused multiple times. In this case
pg_prepared_xact_status function will return incorrect result, because it
will return information about first global transaction with such GID after
checkpoint and not the recent one.

Yeah, this argument alone is why I think that this is a dead-end approach.


May be. But I think that in most real systems unique GIDs are generated, 
because otherwise it is difficult to address concurrency and recovery 
issues.





There is actually alternative approach to recovery of 2PC transactions. We
can include coordinator identifier in GID (we can use GetSystemIdentifier()
to identify coordinator's node)
and XID of coordinator's transaction. In this case we can use txid_status()
to check status of transaction at coordinator. It eliminates need to scan
WAL to determine status of prepared transaction.

+GetOldestRestartPoint(, );
+
+xlogreader = XLogReaderAllocate(_local_xlog_page, NULL);
+if (!xlogreader)
So you scan a bunch of records for each GID? This is really costly. I
think that you would have an easier life by tracking the XID of each
transaction involved remotely. In Postgres-XL, this is not a problem
as XIDs are assigned globally and consistently. But you woul

[HACKERS] Index expression syntax

2017-09-29 Thread Konstantin Knizhnik

I wonder why syntax error is produced in this case:

postgres=# create index metaindex on foo using 
gin(to_tsvector('english', x)||to_tsvector('english',y));

ERROR:  syntax error at or near "||"
LINE 1: ...taindex on foo using gin(to_tsvector('english', x)||to_tsvec...
 ^
The error can be eliminated if extra surrounding parentheses are added:

postgres=# create index metaindex on foo using 
gin((to_tsvector('english', x)||to_tsvector('english',y)));

CREATE INDEX

Postgresql documentations says:

CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ]/|name|/  ] 
ON/|table_name|/  [ USING/|method|/  ]
( {/|column_name|/  | (/|expression|/  ) } [ COLLATE/|collation|/  ] 
[/|opclass|/  ] [ ASC | DESC ] [ NULLS { FIRST | LAST } ] [, ...] )
[ WITH (/|storage_parameter|/  =/|value|/  [, ... ] ) ]
[ TABLESPACE/|tablespace_name|/  ]
[ WHERE/|predicate|/  ]

/|expression:|/
   An expression based on one or more columns of the table. The
   expression usually must be written with surrounding parentheses, as
   shown in the syntax. However, the parentheses can be omitted if the
   expression has the form of a function call.

--





So documentations states that sometimes it is possible to avoid 
parentheses, but it is unclear why I have to use double parentheses...
I think that either grammar should be fixed, either documentation should 
be updated.

/||/

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Surjective functional indexes

2017-09-28 Thread Konstantin Knizhnik

On 09/28/2017 10:10 PM, Robert Haas wrote:

On Wed, Sep 13, 2017 at 7:00 AM, Simon Riggs <si...@2ndquadrant.com> wrote:

If we do have an option it won't be using fancy mathematical
terminology at all, it would be described in terms of its function,
e.g. recheck_on_update

+1.


I have nothing against renaming "projection" option to "recheck_on_update" or 
whatever else is suggested.
Just let me know the best version. From my point of view "recheck_on_update" is too verbose and still not self-explained (to understand the meaning of this option it is necessary to uunderstand how heap_update works). "projection"/"non-injective"/... are 
more declarative notions, explaining the characteristic of the index, while "recheck_on_update"  is more procedural notion, explaining behavior of heap_update.





Yes, I'd rather not have an option at all, just some simple code with
useful effect, like we have in many other places.

I think the question we need to be able to answer is: What is the
probability that an update that would otherwise be non-HOT can be made
into a HOT update by performing a recheck to see whether the value has
changed?  It doesn't seem easy to figure that out from any of the
statistics we have available today or could easily get, because it
depends not only on the behavior of the expression which appears in
the index definition but also on the application behavior.  For
example, consider a JSON blob representing a bank account.
b->'balance' is likely to change most of the time, but
b->'account_holder_name' only rarely.  That's going to be hard for an
automated system to determine.

We should clearly check as many of the other criteria for a HOT update
as possible before performing a recheck of this type, so that it only
gets performed when it might help.  For example, if column a is
indexed and b->'foo' is indexed, there's no point in checking whether
b->'foo' has changed if we know that a has changed.  I don't know
whether it would be feasible to postpone deciding whether to do a
recheck until after we've figured out whether the page seems to
contain enough free space to allow a HOT update.

Turning non-HOT updates into HOT updates is really good, so it seems
likely that the rechecks will often be worthwhile.  If we avoid a HOT
update in 25% of cases, that's probably easily worth the CPU overhead
of a recheck assuming the function isn't something ridiculously
expensive to compute; the extra CPU cost will be repaid by reduced
bloat.  However, if we avoid a HOT update only one time in a million,
it's probably not worth the cost of recomputing the expression the
other 999,999 times.  I wonder where the crossover point is -- it
seems like something that could be figured out by benchmarking.

While I agree that it would be nice to have this be a completely
automatic determination, I am not sure that will be practical.  I
oppose overloading some other marker (like function_cost>1) for
this; that's too magical.


I almost agree with you.
Just few remarks: indexes are rarely created for frequently changed attributes, 
like b->'balance'.
So in case of proper database schema design it is possible to expect that most 
of updates are hot updates: do not actually affect any index.
But certainly different attributes may have different probability of been 
updated.
Unfortunately we do not know before check which attribute of JSON field (or any 
other fields used in indexed expression) is changed.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] pg_prepared_xact_status

2017-09-28 Thread Konstantin Knizhnik

Hi,

In Postgres 10 we have txid_status function which returns status of 
transaction by XID.
I wonder if it will be also useful to have similar function for 2PC 
transactions which can operate with GID?
pg_prepared_xacts view allows to get information about prepared 
transaction which are not yet committed or aborted.
But if transaction is committed, then there is no way now to find status 
of this transaction.


If crash happen during 2PC commit, then transaction can be in prepared 
state at some nodes and committed/aborted at  other nodes.
Using pg_prepared_xacts view DBA can find out global transactions which 
were not completed.
But there is no way (except pg_waldump) to determine whether this 
transaction needs to be committed or aborted at rest of the nodes.


Attached please find small patch with pg_prepared_xact_status function.
This function has the following obvious drawbacks:

1. It is not able to extract information about prepared transaction 
preceding last checkpoint. It seems to be enough to perform recovery in 
case of failure unless
checkpoint happen just before failure or there is large gap between 
prepare and commit.
The only workaround I see at this moment is to pass to this function 
optional parameter with start position in the WAL.

Any better solution?

2. On systems with huge workload interval between checkpoints may be 
very large. In this case we have to scan large amount of WAL data to be 
able to locate our transaction.

Whoich make take significant amount of time.
We can traverse WAL in smarter way, starting from last segment, assuming 
that in-doubt transaction was prepared just before crash.

But it significantly complicates traverse logic.

3. Same GID can be reused multiple times. In this case 
pg_prepared_xact_status function will return incorrect result, because 
it will return information about first global transaction with such GID 
after checkpoint and not the recent one.



There is actually alternative approach to recovery of 2PC transactions. 
We can include coordinator identifier in GID (we can use 
GetSystemIdentifier() to identify coordinator's node)
and XID of coordinator's transaction. In this case we can use 
txid_status() to check status of transaction at coordinator. It 
eliminates need to scan WAL to determine status of prepared transaction.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/transam/twophase.c b/src/backend/access/transam/twophase.c
index ae83291..be65ae7 100644
--- a/src/backend/access/transam/twophase.c
+++ b/src/backend/access/transam/twophase.c
@@ -84,6 +84,7 @@
 #include "access/twophase_rmgr.h"
 #include "access/xact.h"
 #include "access/xlog.h"
+#include "access/xlog_internal.h"
 #include "access/xloginsert.h"
 #include "access/xlogutils.h"
 #include "access/xlogreader.h"
@@ -2408,3 +2409,87 @@ PrepareRedoRemove(TransactionId xid, bool giveWarning)
 
 	return;
 }
+
+Datum
+pg_prepared_xact_status(PG_FUNCTION_ARGS)
+{
+char const* gid = PG_GETARG_CSTRING(0);
+	XLogRecord *record;
+	XLogReaderState *xlogreader;
+	char	   *errormsg;
+	XLogRecPtr lsn;
+	char const* xact_status = "unknown";
+	bool done = false;
+	TimeLineID timeline;
+	TransactionId xid = InvalidTransactionId;
+	XLogRecPtr wal_end = GetFlushRecPtr();
+
+	GetOldestRestartPoint(, );
+
+	xlogreader = XLogReaderAllocate(_local_xlog_page, NULL);
+	if (!xlogreader)
+		ereport(ERROR,
+(errcode(ERRCODE_OUT_OF_MEMORY),
+ errmsg("out of memory"),
+ errdetail("Failed while allocating a WAL reading processor.")));
+	do
+	{
+		record = XLogReadRecord(xlogreader, lsn, );
+		if (record == NULL)
+			break;
+		lsn = InvalidXLogRecPtr; /* continue after the record */
+		if (XLogRecGetRmid(xlogreader) == RM_XACT_ID)
+		{
+			uint32 info = XLogRecGetInfo(xlogreader);
+			switch (info & XLOG_XACT_OPMASK)
+			{
+			case XLOG_XACT_PREPARE:
+{
+	TwoPhaseFileHeader *hdr = (TwoPhaseFileHeader *)XLogRecGetData(xlogreader);
+	char* xact_gid = (char*)hdr + MAXALIGN(sizeof(TwoPhaseFileHeader));
+	if (strcmp(xact_gid, gid) == 0)
+	{
+		xid = hdr->xid;
+		xact_status = "prepared";
+	}
+	break;
+}
+			case XLOG_XACT_COMMIT_PREPARED:
+{
+	xl_xact_commit *xlrec;
+	xl_xact_parsed_commit parsed;
+
+	xlrec = (xl_xact_commit *) XLogRecGetData(xlogreader);
+	ParseCommitRecord(info, xlrec, );
+	if (xid == parsed.twophase_xid)
+	{
+		Assert(TransactionIdIsValid(xid));
+		xact_status = "committed";
+		done = true;
+	}
+	break;
+}
+			case XLOG_XACT_ABORT_PREPARED:
+{
+	xl_xact_abort *xlrec;
+	xl_xact_parsed_abort parsed;
+
+	xlrec = (xl_xact_abort *) XLogRecGetData(xlogreader);
+	ParseAbortRecord(info, xlrec, );
+	if (xid == parsed.twophase

Re: [HACKERS] JIT compiling expressions/deform + inlining prototype v2.0

2017-09-19 Thread Konstantin Knizhnik



On 04.09.2017 23:52, Andres Freund wrote:


Hi. That piece of code isn't particularly clear (and has a bug in the
submitted version), I'm revising it.


...

Yea, I've changed that already, although it's currently added earlier,
because the alignment is needed before, to access the column correctly.
I've also made number of efficiency improvements, primarily to access
columns with an absolute offset if all preceding ones are fixed width
not null columns - that is quite noticeable performancewise.



Should I wait for new version of your patch or continue review of this code?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surjective functional indexes

2017-09-15 Thread Konstantin Knizhnik



On 14.09.2017 18:53, Simon Riggs wrote:

It's not going to work, as already mentioned above. Those stats are at
table level and very little to do with this particular index.

But you've not commented on the design I mention that can work: index relcache.


Concerning your idea to check cost of index function: it certainly makes
sense.
The only problems: I do not understand now how to calculate this cost.
It can be easily calculated by optimizer when it is building query execution
plan.
But inside BuildIndexInfo I have just reference to Relation and have no idea
how
I can propagate here information about index expression cost from optimizer.

We could copy at create index, if we took that route. Or we can look
up the cost for the index expression and cache it.


Anyway, this is just jumping around because we still have a parameter
and the idea was to remove the parameter entirely by autotuning, which
I think is both useful and possible, just as HOT itself is autotuned.



Attached please find yet another version of the patch.
I have to significantly rewrite it,  because my first attempts to add 
auto-tune were not correct.

New patch does it in correct way (I hope) and more efficiently.
I moved auto-tune code from BuildIndexInfo, which is called many times, 
including heap_update (so at least once per update tuple).
to RelationGetIndexAttrBitmap which is called only when cached 
RelationData is filled by backend.
The problem with my original implementation of auto-tune was that 
switching off "projection" property of index, it doesn't update 
attribute masks,

calculated by RelationGetIndexAttrBitmap.

I have also added check for maximal cost of indexed expression.
So now decision whether to apply projection index optimization (compare 
old and new values of indexed expression)

is based  on three sources:
 1. Calculated hot update statistic: we compare number of hot updates 
which are performed
because projection index check shows that index expression is not 
changed with total
number of updates affecting attributes used in projection indexes. 
If it is smaller than

some threshold (10%), then index is considered as non-projective.
 2. Calculated cost of index expression: if it is higher than some 
threshold (1000) then
extra comparison of index expression values is expected to be too 
expensive.
 3. "projection" index option explicitly set by user. This setting 
overrides 1) and 2)




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 83ee7d3..52189ac 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -294,8 +294,33 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 The optional WITH clause specifies storage
 parameters for the index.  Each index method has its own set of allowed
-storage parameters.  The B-tree, hash, GiST and SP-GiST index methods all
-accept this parameter:
+storage parameters. All indexes accept the following parameter:
+   
+
+   
+   
+projection
+
+ 
+   Functional index is based on on projection function: function which extract subset of its argument.
+   In mathematic such functions are called non-injective. For injective function if any attribute used in the indexed
+   expression is changed, then value of index expression is also changed. So to check that index is affected by the
+   update, it is enough to check the set of changed fields. By default this parameters is assigned true value and function is considered
+   as non-injective.
+   In this case change of any of indexed key doesn't mean that value of the function is changed. For example, for
+   the expression expression(bookinfo-'isbn') defined
+   for column of JSON type is changed only when ISBN is changed, which rarely happen. The same is true for most
+   functional indexes. For non-injective functions, Postgres compares values of indexed expression for old and updated tuple and updates
+   index only when function results are different. It allows to eliminate index update and use HOT update.
+   But there are extra evaluations of the functions. So if function is expensive or probability that change of indexed column will not effect
+   the function value is small, then marking index as projection may increase update speed.
+
+
+   
+   
+
+   
+ The B-tree, hash, GiST and SP-GiST index methods all accept this parameter:

 

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index ec10762..b73165f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -130,6 +130,15 @@ static relopt_bool boolRelOpts[] =
 	},
 	{
 		{
+			"projection",
+			"Evaluate functional index expressi

Re: [HACKERS] Surjective functional indexes

2017-09-15 Thread Konstantin Knizhnik



On 14.09.2017 18:53, Simon Riggs wrote:



This works by looking at overall stats, and only looks at the overall
HOT %, so its too heavyweight and coarse.

I suggested storing stat info on the relcache and was expecting you
would look at how often the expression evaluates to new == old. If we
evaluate new against old many times, then if the success rate is low
we should stop attempting the comparison. (<10%?)

Another idea:
If we don't make a check when we should have done then we will get a
non-HOT update, so we waste time extra time difference between a HOT
and non-HOT update. If we check and fail we waste time take to perform
check. So the question is how expensive the check is against how
expensive a non-HOT update is. Could we simply say we don't bother to
check functions that have a cost higher than 1? So if the user
doesn't want to perform the check they can just increase the cost of
the function above the check threshold?


Attached pleased find one more patch which calculates hot update check hit
rate more precisely: I have to extended PgStat_StatTabEntry with two new
fields:
hot_update_hits and hot_update_misses.

It's not going to work, as already mentioned above. Those stats are at
table level and very little to do with this particular index.

But you've not commented on the design I mention that can work: index relcache.

Sorry, I do not completely agree with you.
Yes, certainly whether functional index is projective or not is property 
of the index, not of the table.
But the decision whether hot update is applicable or not is made for the 
whole table - for all indexes.
If a value of just one indexed expressions is changed then we can not 
use hot update and have to update all indexes.


Assume that we have table with "bookinfo" field of type JSONB.
And we create several functional indexes on this column: 
(bookinfo->'isbn'), (bookinfo->'title'), (bookinfo->'author'), 
(bookinfo->'rating').
Probability that indexed expression is changed is case of updating 
"bookinfo" field my be different for all this three indexes.
But there is completely no sense to check if 'isbn' is changed or not, 
if we already detect that most updates cause change of 'rating' 
attribute and
so comparing old and new values of (bookinfo->'rating') is just waste of 
time. In this case we should not also compare (bookinfo->'isbn') and
other indexed expressions because for already rejected possibility of 
hot update.


So despite to the fact that this information depends on particular 
index, it affects behavior of the whole table and it is reasonable (and 
simpler) to collect it in table's statistic.



Concerning your idea to check cost of index function: it certainly makes
sense.
The only problems: I do not understand now how to calculate this cost.
It can be easily calculated by optimizer when it is building query execution
plan.
But inside BuildIndexInfo I have just reference to Relation and have no idea
how
I can propagate here information about index expression cost from optimizer.

We could copy at create index, if we took that route. Or we can look
up the cost for the index expression and cache it.


Anyway, this is just jumping around because we still have a parameter
and the idea was to remove the parameter entirely by autotuning, which
I think is both useful and possible, just as HOT itself is autotuned.



Hot update in almost all cases is preferable to normal update, causing 
update of indexes.
There are can be some scenarios when hot updates reduce speed of some 
queries,

but it is very difficult to predict such cases user level.

But usually nature of index is well known by DBA or programmer. In 
almost all cases it is clear for person creating functional index 
whether it will perform projection or not
and whether comparing old/new expression value makes sense or is just 
waste of time. We can guess it from autotune, but such decision may be 
wrong (just because of application
business logic). Postgres indexes already have a lot of options. And I 
think that "projection" option (or whatever we name it) is also needed.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surjective functional indexes

2017-09-14 Thread Konstantin Knizhnik



On 14.09.2017 13:19, Simon Riggs wrote:

On 14 September 2017 at 10:42, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:


On 13.09.2017 14:00, Simon Riggs wrote:

On 13 September 2017 at 11:30, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:


The only reason of all this discussion about terms is that I need to
choose
name for correspondent index option.
Simon think that we do not need this option at all. In this case we
should
not worry about right term.
  From my point of view, "projection" is quite clear notion and not only
for
mathematics. It is also widely used in IT and especially in DBMSes.

If we do have an option it won't be using fancy mathematical
terminology at all, it would be described in terms of its function,
e.g. recheck_on_update

Yes, I'd rather not have an option at all, just some simple code with
useful effect, like we have in many other places.


Attached please find new version of projection functional index optimization
patch.
I have implemented very simple autotune strategy: now I use table statistic
to compare total number of updates with number of hot updates.
If fraction of hot updates is relatively small, then there is no sense to
spend time performing extra evaluation of index expression and comparing its
old and new values.
Right now the formula is the following:

#define MIN_UPDATES_THRESHOLD 10
#define HOT_RATIO_THRESHOLD   2

 if (stat->tuples_updated > MIN_UPDATES_THRESHOLD
 && stat->tuples_updated >
stat->tuples_hot_updated*HOT_RATIO_THRESHOLD)
 {
 /* If percent of hot updates is small, then disable projection
index function
  * optimization to eliminate overhead of extra index expression
evaluations.
  */
 ii->ii_Projection = false;
 }

This threshold values are pulled out of a hat: I am not sure if this
heuristic is right.
I will be please to get feedback if such approach to autotune is promising.

Hmm, not really, but thanks for trying.

This works by looking at overall stats, and only looks at the overall
HOT %, so its too heavyweight and coarse.

I suggested storing stat info on the relcache and was expecting you
would look at how often the expression evaluates to new == old. If we
evaluate new against old many times, then if the success rate is low
we should stop attempting the comparison. (<10%?)

Another idea:
If we don't make a check when we should have done then we will get a
non-HOT update, so we waste time extra time difference between a HOT
and non-HOT update. If we check and fail we waste time take to perform
check. So the question is how expensive the check is against how
expensive a non-HOT update is. Could we simply say we don't bother to
check functions that have a cost higher than 1? So if the user
doesn't want to perform the check they can just increase the cost of
the function above the check threshold?

Attached pleased find one more patch which calculates hot update check 
hit rate more precisely: I have to extended PgStat_StatTabEntry with two 
new fields:

hot_update_hits and hot_update_misses.

Concerning your idea to check cost of index function: it certainly makes 
sense.

The only problems: I do not understand now how to calculate this cost.
It can be easily calculated by optimizer when it is building query 
execution plan.
But inside BuildIndexInfo I have just reference to Relation and have no 
idea how

I can propagate here information about index expression cost from optimizer.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 83ee7d3..52189ac 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -294,8 +294,33 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 The optional WITH clause specifies storage
 parameters for the index.  Each index method has its own set of allowed
-storage parameters.  The B-tree, hash, GiST and SP-GiST index methods all
-accept this parameter:
+storage parameters. All indexes accept the following parameter:
+   
+
+   
+   
+projection
+
+ 
+   Functional index is based on on projection function: function which extract subset of its argument.
+   In mathematic such functions are called non-injective. For injective function if any attribute used in the indexed
+   expression is changed, then value of index expression is also changed. So to check that index is affected by the
+   update, it is enough to check the set of changed fields. By default this parameters is assigned true value and function is considered
+   as non-injective.
+   In this case change of any of indexed key doesn't mean that value of the function is changed. For example, for
+   the expression expression(bookinfo-'isbn') defined
+

Re: [HACKERS] Surjective functional indexes

2017-09-14 Thread Konstantin Knizhnik



On 13.09.2017 14:00, Simon Riggs wrote:

On 13 September 2017 at 11:30, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:


The only reason of all this discussion about terms is that I need to choose
name for correspondent index option.
Simon think that we do not need this option at all. In this case we should
not worry about right term.
 From my point of view, "projection" is quite clear notion and not only for
mathematics. It is also widely used in IT and especially in DBMSes.

If we do have an option it won't be using fancy mathematical
terminology at all, it would be described in terms of its function,
e.g. recheck_on_update

Yes, I'd rather not have an option at all, just some simple code with
useful effect, like we have in many other places.

Attached please find new version of projection functional index 
optimization patch.
I have implemented very simple autotune strategy: now I use table 
statistic to compare total number of updates with number of hot updates.
If fraction of hot updates is relatively small, then there is no sense 
to spend time performing extra evaluation of index expression and 
comparing its old and new values.

Right now the formula is the following:

#define MIN_UPDATES_THRESHOLD 10
#define HOT_RATIO_THRESHOLD   2

if (stat->tuples_updated > MIN_UPDATES_THRESHOLD
&& stat->tuples_updated > 
stat->tuples_hot_updated*HOT_RATIO_THRESHOLD)

{
/* If percent of hot updates is small, then disable 
projection index function
 * optimization to eliminate overhead of extra index 
expression evaluations.

 */
ii->ii_Projection = false;
}

This threshold values are pulled out of a hat: I am not sure if this 
heuristic is right.

I will be please to get feedback if such approach to autotune is promising.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 83ee7d3..52189ac 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -294,8 +294,33 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 The optional WITH clause specifies storage
 parameters for the index.  Each index method has its own set of allowed
-storage parameters.  The B-tree, hash, GiST and SP-GiST index methods all
-accept this parameter:
+storage parameters. All indexes accept the following parameter:
+   
+
+   
+   
+projection
+
+ 
+   Functional index is based on on projection function: function which extract subset of its argument.
+   In mathematic such functions are called non-injective. For injective function if any attribute used in the indexed
+   expression is changed, then value of index expression is also changed. So to check that index is affected by the
+   update, it is enough to check the set of changed fields. By default this parameters is assigned true value and function is considered
+   as non-injective.
+   In this case change of any of indexed key doesn't mean that value of the function is changed. For example, for
+   the expression expression(bookinfo-'isbn') defined
+   for column of JSON type is changed only when ISBN is changed, which rarely happen. The same is true for most
+   functional indexes. For non-injective functions, Postgres compares values of indexed expression for old and updated tuple and updates
+   index only when function results are different. It allows to eliminate index update and use HOT update.
+   But there are extra evaluations of the functions. So if function is expensive or probability that change of indexed column will not effect
+   the function value is small, then marking index as projection may increase update speed.
+
+
+   
+   
+
+   
+ The B-tree, hash, GiST and SP-GiST index methods all accept this parameter:

 

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index ec10762..b73165f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -130,6 +130,15 @@ static relopt_bool boolRelOpts[] =
 	},
 	{
 		{
+			"projection",
+			"Evaluate functional index expression on update to check if its values is changed",
+			RELOPT_KIND_INDEX,
+			AccessExclusiveLock
+		},
+		true
+	},
+	{
+		{
 			"security_barrier",
 			"View acts as a row security barrier",
 			RELOPT_KIND_VIEW,
@@ -1301,7 +1310,7 @@ fillRelOptions(void *rdopts, Size basesize,
 break;
 			}
 		}
-		if (validate && !found)
+		if (validate && !found && options[i].gen->kinds != RELOPT_KIND_INDEX)
 			elog(ERROR, "reloption \"%s\" not found in parse table",
  options[i].gen->name);
 	}
diff --git a/src/

Re: [HACKERS] Surjective functional indexes

2017-09-13 Thread Konstantin Knizhnik



On 13.09.2017 14:00, Simon Riggs wrote:

On 13 September 2017 at 11:30, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:


The only reason of all this discussion about terms is that I need to choose
name for correspondent index option.
Simon think that we do not need this option at all. In this case we should
not worry about right term.
 From my point of view, "projection" is quite clear notion and not only for
mathematics. It is also widely used in IT and especially in DBMSes.

If we do have an option it won't be using fancy mathematical
terminology at all, it would be described in terms of its function,
e.g. recheck_on_update

Yes, I'd rather not have an option at all, just some simple code with
useful effect, like we have in many other places.


Yehhh,
After more thinking I found out that my idea to use table/index 
statistic (particularity number of distinct values) to determine 
projection functions  was wrong.
Consider case column bookinfo of jsonb type and index expression 
(bookinfo->'ISBN').
Both can be considered as unique. But it is an obvious example of 
projection function, which value is  not changed if we update other 
information related with this book.


So this approach doesn't work. Looks like the only thing we can do to 
autotune is to collect own statistic: how frequently changing 
attribute(s) doesn't affect result of the function.
By default we can considered function as projection and perform 
comparison of old/new function results.
If after some number of comparisons  fraction of hits (when value of 
function is not changed) is smaller than some threshold (0.5?, 0.9?,...) 
then we can mark index as non-projective
and eliminate this checks in future. But it will require extending index 
statistic. Do we really need/want it?


Despite to the possibility to implement autotune, I still think that we 
should have manual switch, doesn't mater how it is named.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surjective functional indexes

2017-09-13 Thread Konstantin Knizhnik



On 13.09.2017 13:14, Christoph Berg wrote:

Re: Konstantin Knizhnik 2017-09-13 
<2393c4b3-2ec4-dc68-4ea9-670597b56...@postgrespro.ru>


On 13.09.2017 10:51, Christoph Berg wrote:

Re: Konstantin Knizhnik 2017-09-01 
<f530ede0-1bf6-879c-c362-34325514f...@postgrespro.ru>

+   Functional index is based on on projection function: function which 
extract subset of its argument.
+   In mathematic such functions are called non-injective. For injective 
function if any attribute used in the indexed
+   expression is changed, then value of index expression is also changed.

This is Just Wrong. I still think what you are doing here doesn't have
anything to do with the function being injective or not.

Sorry, can you please explain what is wrong?

I don't get why you are reasoning about "projection" ->
"non-injective" -> "injective". Can't you try to explain what this
functionality is about without abusing math terms that just mean
something else in the rest of the world?


I tried to explain it in my previous e-mail. In most cases (it is just 
my filling, may be it is wrong), functional indexes are built for some 
complex types, like JSON, arrays, structs,...
and index expression extracts some components of this compound value. It 
means that even if underlying column is changes, there is good chance 
that value of index function is not changed. So there is no need to 
update index and we can use HOT. It allows to several time increase 
performance.


The only reason of all this discussion about terms is that I need to 
choose name for correspondent index option.
Simon think that we do not need this option at all. In this case we 
should not worry about right term.
From my point of view, "projection" is quite clear notion and not only 
for mathematics. It is also widely used in IT and especially in DBMSes.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surjective functional indexes

2017-09-13 Thread Konstantin Knizhnik



On 13.09.2017 10:51, Christoph Berg wrote:

Re: Konstantin Knizhnik 2017-09-01 
<f530ede0-1bf6-879c-c362-34325514f...@postgrespro.ru>

+   Functional index is based on on projection function: function which 
extract subset of its argument.
+   In mathematic such functions are called non-injective. For injective 
function if any attribute used in the indexed
+   expression is changed, then value of index expression is also changed.

This is Just Wrong. I still think what you are doing here doesn't have
anything to do with the function being injective or not.


Sorry, can you please explain what is wrong?
The problem I am trying to solve comes from particular use case: 
functional index on part of JSON column.
Usually such index is built for persistent attributes, which are rarely 
changed, like ISBN...
Right now any update of JSON column disables hot update. Even if such 
update doesn't really affect index.
So instead of disabling HOT juts based on mask of modified attributes, I 
suggest to compare old and new value of index expression.


Such behavior can significantly (several times) increase performance. 
But only for "projection" functions.
There was long discussion in this thread about right notion for this 
function (subjective,  non-injective,  projection).

But I think criteria is quite obvious.

Simon propose eliminate "projection" property and use autotune to 
determine optimal behavior.
I still think that such option will be useful, but we can really use 
statistic to compare number of unique values for index function and for 
it's argument(s).
If them are similar, then most likely the function is injective, so it 
produce different result for different attributes.
Then there is no sense to spend extra CPU time, calculating old and new 
values of the function.

This is what I am going to implement now.

So I will be please if you more precisely explain your concerns and 
suggestions (if you have one).


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surjective functional indexes

2017-09-12 Thread Konstantin Knizhnik



On 12.09.2017 19:28, Simon Riggs wrote:

On 1 September 2017 at 09:47, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

On 01.09.2017 09:25, Simon Riggs wrote:

On 1 September 2017 at 05:40, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:

On Fri, Jun 9, 2017 at 8:08 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Attached please find rebased version of the patch.
Now "projection" attribute is used instead of surjective/injective.

Hi Konstantin,

This still applies but it doesn't compile after commits 2cd70845 and
c6293249.  You need to change this:

Form_pg_attribute att = RelationGetDescr(indexDesc)->attrs[i];

... to this:

Form_pg_attribute att = TupleDescAttr(RelationGetDescr(indexDesc),
i);

Thanks!

Does the patch work fully with that change? If so, I will review.


Attached please find rebased version of the patch.
Yes, I checked that it works after this fix.
Thank you in advance for review.

Thanks for the patch. Overall looks sound and I consider that we are
working towards commit for this.

The idea is that we default "projection = on", and can turn it off in
case the test is expensive. Why bother to have the option? (No docs at
all then!) Why not just evaluate the test and autotune whether to make
the test again in the future? That way we can avoid having an option
completely. I am imagining collecting values on the relcache entry for
the index.


Autotune is definitely good thing. But I do not think that excludes 
having explicit parameter for manual tuning.
For some functional indexes DBA or programmer knows for sure that it 
doesn't perform projection.
For example if it translates or changes encoding of original key. It 
seems to me that we should make it possible to

declare this index as non-projective and do not rely on autotune.

Also I have some doubts concerning using autotune in this case. First of 
all it is very hard to estimate complexity of test.
How can we measure it? Calculate average execution time? It can vary for 
different systems and greatly depends on system load...
Somehow calculate cost of indexed expression? It may be also not always 
produce expected result.


Moreover, in some cases test may be not expensive, but still useless, if 
index expression specifies one-to-one mapping (for example function 
reversing key).
Autotone will never be able to reliable determine that indexed 
expression is projection or not.


It seems to be more precise to compare statistic for source column and 
index expression.
If them are similar, then most likely index expression is not a 
projection...

I will think more about it.


To implement autotuning we would need to instrument the execution. We
could then display the collected value via EXPLAIN, so we could just
then use EXPLAIN in your tests rather than implementing a special
debug mode just for testing. We could also pass that information thru
to stats as well.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-09-12 Thread Konstantin Knizhnik



On 11.09.2017 12:24, Konstantin Knizhnik wrote:

Attached please find rebased version of the patch.
There are the updated performance results (pgbench -s 100 -c 1):

protocol (-M)
read-write
read-only (-S)
simple
3327
19325
extended
2256
16908
prepared
6145
39326
simple+autoprepare
4950
34841




One more patch passing all regression tests with autoprepare_threshold=1.
I still do not think that it should be switch on by default...

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index e3eb0c5..17f3dfd 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3688,6 +3688,454 @@ raw_expression_tree_walker(Node *node,
 }
 
 /*
+ * raw_expression_tree_mutator --- transform raw parse tree.
+ *
+ * This function is implementing slightly different approach for tree update than expression_tree_mutator().
+ * Callback is given pointer to pointer to the current node and can update this field instead of returning reference to new node.
+ * It makes it possible to remember changes and easily revert them without extra traversal of the tree.
+ *
+ * This function do not need QTW_DONT_COPY_QUERY flag: it never implicitly copy tree nodes, doing in-place update.
+ *
+ * Like raw_expression_tree_walker, there is no special rule about query
+ * boundaries: we descend to everything that's possibly interesting.
+ *
+ * Currently, the node type coverage here extends only to DML statements
+ * (SELECT/INSERT/UPDATE/DELETE) and nodes that can appear in them, because
+ * this is used mainly during analysis of CTEs, and only DML statements can
+ * appear in CTEs. If some other node is visited, iteration is immediately stopped and true is returned.
+ */
+bool
+raw_expression_tree_mutator(Node *node,
+			bool (*mutator) (),
+			void *context)
+{
+	ListCell   *temp;
+
+	/*
+	 * The walker has already visited the current node, and so we need only
+	 * recurse into any sub-nodes it has.
+	 */
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex expressions */
+	check_stack_depth();
+
+	switch (nodeTag(node))
+	{
+		case T_SetToDefault:
+		case T_CurrentOfExpr:
+		case T_Integer:
+		case T_Float:
+		case T_String:
+		case T_BitString:
+		case T_Null:
+		case T_ParamRef:
+		case T_A_Const:
+		case T_A_Star:
+			/* primitive node types with no subnodes */
+			break;
+		case T_Alias:
+			/* we assume the colnames list isn't interesting */
+			break;
+		case T_RangeVar:
+			return mutator(&((RangeVar *) node)->alias, context);
+		case T_GroupingFunc:
+			return mutator(&((GroupingFunc *) node)->args, context);
+		case T_SubLink:
+			{
+SubLink	   *sublink = (SubLink *) node;
+
+if (mutator(>testexpr, context))
+	return true;
+/* we assume the operName is not interesting */
+if (mutator(>subselect, context))
+	return true;
+			}
+			break;
+		case T_CaseExpr:
+			{
+CaseExpr   *caseexpr = (CaseExpr *) node;
+
+if (mutator(>arg, context))
+	return true;
+/* we assume mutator(& doesn't care about CaseWhens, either */
+foreach(temp, caseexpr->args)
+{
+	CaseWhen   *when = (CaseWhen *) lfirst(temp);
+
+	Assert(IsA(when, CaseWhen));
+	if (mutator(>expr, context))
+		return true;
+	if (mutator(>result, context))
+		return true;
+}
+if (mutator(>defresult, context))
+	return true;
+			}
+			break;
+		case T_RowExpr:
+			/* Assume colnames isn't interesting */
+			return mutator(&((RowExpr *) node)->args, context);
+		case T_CoalesceExpr:
+			return mutator(&((CoalesceExpr *) node)->args, context);
+		case T_MinMaxExpr:
+			return mutator(&((MinMaxExpr *) node)->args, context);
+		case T_XmlExpr:
+			{
+XmlExpr	   *xexpr = (XmlExpr *) node;
+
+if (mutator(>named_args, context))
+	return true;
+/* we assume mutator(& doesn't care about arg_names */
+if (mutator(>args, context))
+	return true;
+			}
+			break;
+		case T_NullTest:
+			return mutator(&((NullTest *) node)->arg, context);
+		case T_BooleanTest:
+			return mutator(&((BooleanTest *) node)->arg, context);
+		case T_JoinExpr:
+			{
+JoinExpr   *join = (JoinExpr *) node;
+
+if (mutator(>larg, context))
+	return true;
+if (mutator(>rarg, context))
+	return true;
+if (mutator(>quals, context))
+	return true;
+if (mutator(>alias, context))
+	return true;
+/* using list is deemed uninteresting */
+			}
+			break;
+		case T_IntoClause:
+			{
+IntoClause *into = (IntoClause *) node;
+
+if (mutator(>rel, context))
+	return true;
+/* colNames, options are deemed uninteresting */
+/* viewQuery should be null in raw parsetree, but check it */
+if (mutato

Re: [HACKERS] Cached plans and statement generalization

2017-09-11 Thread Konstantin Knizhnik



On 09.09.2017 06:35, Thomas Munro wrote:

On Fri, May 26, 2017 at 3:54 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Attached please find rebased version of the autoprepare patch based on Tom's
proposal (perform analyze for tree with constant literals and then replace
them with parameters).
Also I submitted this patch for the Autum commitfest.

The patch didn't survive the Summer bitrotfest.  Could you please rebase it?


Attached please find rebased version of the patch.
There are the updated performance results (pgbench -s 100 -c 1):

protocol (-M)
read-write
read-only (-S)
simple
3327
19325
extended
2256
16908
prepared
6145
39326
simple+autoprepare
4950
34841



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index e3eb0c5..17f3dfd 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3688,6 +3688,454 @@ raw_expression_tree_walker(Node *node,
 }
 
 /*
+ * raw_expression_tree_mutator --- transform raw parse tree.
+ *
+ * This function is implementing slightly different approach for tree update than expression_tree_mutator().
+ * Callback is given pointer to pointer to the current node and can update this field instead of returning reference to new node.
+ * It makes it possible to remember changes and easily revert them without extra traversal of the tree.
+ *
+ * This function do not need QTW_DONT_COPY_QUERY flag: it never implicitly copy tree nodes, doing in-place update.
+ *
+ * Like raw_expression_tree_walker, there is no special rule about query
+ * boundaries: we descend to everything that's possibly interesting.
+ *
+ * Currently, the node type coverage here extends only to DML statements
+ * (SELECT/INSERT/UPDATE/DELETE) and nodes that can appear in them, because
+ * this is used mainly during analysis of CTEs, and only DML statements can
+ * appear in CTEs. If some other node is visited, iteration is immediately stopped and true is returned.
+ */
+bool
+raw_expression_tree_mutator(Node *node,
+			bool (*mutator) (),
+			void *context)
+{
+	ListCell   *temp;
+
+	/*
+	 * The walker has already visited the current node, and so we need only
+	 * recurse into any sub-nodes it has.
+	 */
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex expressions */
+	check_stack_depth();
+
+	switch (nodeTag(node))
+	{
+		case T_SetToDefault:
+		case T_CurrentOfExpr:
+		case T_Integer:
+		case T_Float:
+		case T_String:
+		case T_BitString:
+		case T_Null:
+		case T_ParamRef:
+		case T_A_Const:
+		case T_A_Star:
+			/* primitive node types with no subnodes */
+			break;
+		case T_Alias:
+			/* we assume the colnames list isn't interesting */
+			break;
+		case T_RangeVar:
+			return mutator(&((RangeVar *) node)->alias, context);
+		case T_GroupingFunc:
+			return mutator(&((GroupingFunc *) node)->args, context);
+		case T_SubLink:
+			{
+SubLink	   *sublink = (SubLink *) node;
+
+if (mutator(>testexpr, context))
+	return true;
+/* we assume the operName is not interesting */
+if (mutator(>subselect, context))
+	return true;
+			}
+			break;
+		case T_CaseExpr:
+			{
+CaseExpr   *caseexpr = (CaseExpr *) node;
+
+if (mutator(>arg, context))
+	return true;
+/* we assume mutator(& doesn't care about CaseWhens, either */
+foreach(temp, caseexpr->args)
+{
+	CaseWhen   *when = (CaseWhen *) lfirst(temp);
+
+	Assert(IsA(when, CaseWhen));
+	if (mutator(>expr, context))
+		return true;
+	if (mutator(>result, context))
+		return true;
+}
+if (mutator(>defresult, context))
+	return true;
+			}
+			break;
+		case T_RowExpr:
+			/* Assume colnames isn't interesting */
+			return mutator(&((RowExpr *) node)->args, context);
+		case T_CoalesceExpr:
+			return mutator(&((CoalesceExpr *) node)->args, context);
+		case T_MinMaxExpr:
+			return mutator(&((MinMaxExpr *) node)->args, context);
+		case T_XmlExpr:
+			{
+XmlExpr	   *xexpr = (XmlExpr *) node;
+
+if (mutator(>named_args, context))
+	return true;
+/* we assume mutator(& doesn't care about arg_names */
+if (mutator(>args, context))
+	return true;
+			}
+			break;
+		case T_NullTest:
+			return mutator(&((NullTest *) node)->arg, context);
+		case T_BooleanTest:
+			return mutator(&((BooleanTest *) node)->arg, context);
+		case T_JoinExpr:
+			{
+JoinExpr   *join = (JoinExpr *) node;
+
+if (mutator(>larg, context))
+	return true;
+if (mutator(>rarg, context))
+	return true;
+if (mutator(>quals, context))
+	return true;
+if (mutator(>alias, context))
+	return true;
+/* using list is deemed uninteresting */
+			}
+			break;
+		

Re: [HACKERS] Secondary index access optimizations

2017-09-07 Thread Konstantin Knizhnik



On 07.09.2017 13:00, Thomas Munro wrote:

On Sun, Sep 3, 2017 at 4:34 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Thank you for review.
I attached new version of the patch with
remove_restrictions_implied_by_constraints() function.
Concerning failed tests - this is actually result of this optimization:
extra filter conditions are removed from query plans.
Sorry, I have not included updated version of expected test output files to
the patch.
Now I did it.

A regression test under contrib/postgres_fdw now fails:

- Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T
1" WHERE (("C 1" IS NOT NULL))
+ Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1"

(("C 1" IS NOT NULL)) is indeed redundant in that case, because column
"C 1" was declared to be NOT NULL.  But:

1.  Do we want to go this far?  Note that this is not involving
inheritance and constraint exclusion.  I don't immediately see any
reason why not, but I'm not sure.

2.  If yes, then this postgres_fdw test should be changed, because I
think it was trying to demonstrate that IS NOT NULL expressions are
sent to remote databases -- it would need to be changed so that it
tries that with a column that is actually nullable.

I do not see any reasons why we should disable this optimization in case 
of FDW.

And disabling it requires some extra efforts...

So I have updated test for postgres_fdw, replacing query
 SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;
with
 SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;

Now it checks two things:
1. That not null check is passed to foreign server for nullable column (c3)
2. That not null check is excluded from query execution plan when it can 
be omitted because column is not nullable.


Updated version of the patch is attached to this mail.
Also I added support of date type to operator_predicate_proof to be able 
to imply (logdate <= '2017-03-31') from (logdate < '2017-04-01') .


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/expected/postgres_fdw.out b/contrib/postgres_fdw/expected/postgres_fdw.out
index c19b331..a9cce14 100644
--- a/contrib/postgres_fdw/expected/postgres_fdw.out
+++ b/contrib/postgres_fdw/expected/postgres_fdw.out
@@ -626,12 +626,12 @@ EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- Nu
Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NULL))
 (3 rows)
 
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
- QUERY PLAN  
--
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
+QUERY PLAN
+--
  Foreign Scan on public.ft1 t1
Output: c1, c2, c3, c4, c5, c6, c7, c8
-   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE (("C 1" IS NOT NULL))
+   Remote SQL: SELECT "C 1", c2, c3, c4, c5, c6, c7, c8 FROM "S 1"."T 1" WHERE ((c3 IS NOT NULL))
 (3 rows)
 
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
diff --git a/contrib/postgres_fdw/sql/postgres_fdw.sql b/contrib/postgres_fdw/sql/postgres_fdw.sql
index 5f65d9d..2d816db 100644
--- a/contrib/postgres_fdw/sql/postgres_fdw.sql
+++ b/contrib/postgres_fdw/sql/postgres_fdw.sql
@@ -292,7 +292,7 @@ RESET enable_nestloop;
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 1; -- Var, OpExpr(b), Const
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE t1.c1 = 100 AND t1.c2 = 0; -- BoolExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NULL;-- NullTest
-EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL;-- NullTest
+EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 IS NOT NULL and c3 is not null;-- NullTest
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE round(abs(c1), 0) = 1; -- FuncExpr
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE c1 = -c1;  -- OpExpr(l)
 EXPLAIN (VERBOSE, COSTS OFF) SELECT * FROM ft1 t1 WHERE 1 = c1!;   -- OpExpr(r)
diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 2d7e1d8..5de67ce 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -344,6 +344,7 @@ s

Re: [HACKERS] JIT compiling expressions/deform + inlining prototype v2.0

2017-09-05 Thread Konstantin Knizhnik



On 04.09.2017 23:52, Andres Freund wrote:


Yea, I've changed that already, although it's currently added earlier,
because the alignment is needed before, to access the column correctly.
I've also made number of efficiency improvements, primarily to access
columns with an absolute offset if all preceding ones are fixed width
not null columns - that is quite noticeable performancewise.


Unfortunately, in most of real table columns are nullable.
I wonder if we can perform some optimization in this case (assuming that 
in typical cases column either contains mostly non-null values, either 
mostly null values).


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Secondary index access optimizations

2017-09-05 Thread Konstantin Knizhnik



On 05.09.2017 04:02, Amit Langote wrote:

Like Thomas, I'm not so sure about the whole predtest.c patch.  The core
logic in operator_predicate_proof() should be able to conclude that, say,
k < 21 is implied by k <= 20, which you are trying to address with some
special case code.  If there is still problem you think need to be fixed
here, a better place to look at would be somewhere around get_btree_test_op().


Frankly speaking I also do not like this part of my patch.
I will be pleased if you or somebody else can propose better solution.
I do not understand how get_btree_test_op() can help here.

Yes, k < 21 is implied by k <= 20. It is based on generic properties of 
< and  <= operators.
But I need to proof something different: having table partition 
constraint (k < 21) I want to remove predicate (k <= 20) from query.
In other words,  operator_predicate_proof() should be able to conclude 
that (k <= 20) is implied by (k < 21).
But it is true only for integer types, not for floating point types. And 
Postgres operator definition
doesn't provide some way to determine that user defined type is integer 
type: has integer values for which such conclusion is true.


Why I think that it is important? Certainly, it is possible to rewrite 
query as (k < 21) and no changes in operator_predicate_proof() are needed.
Assume the most natural use case: I have some positive integer key and I 
wan to range partition table by such key, for example with interval 1.
Currently standard PostgreSQL partitioning mechanism requires to specify 
intervals with open high boundary.
So if I want first partition to contain interval [1,1], second - 
[10001,20001],... I have to create partitions in such way:


create table bt (k integer, v integer) partition by range (k);
create table dt1 partition of bt for values from (1) to (10001);
create table dt2 partition of bt for values from (10001) to (20001);
...

If I want to write query inspecting data of the particular partition, 
then most likely I will use BETWEEN operator:


SELECT * FROM t WHERE k BETWEEN 1 and 1;

But right now operator_predicate_proof() is not able to conclude that 
predicate (k BETWEEN 1 and 1) transformed to (k >= 1 AND k <= 1) 
is equivalent to (k >= 1 AND k < 10001)

which is used as partition constraint.

Another very popular use case (for example mentioned in PostgreSQL 
documentation of partitioning: 
https://www.postgresql.org/docs/10/static/ddl-partitioning.html)

is using date as partition key:

CREATE TABLE measurement (
city_id int not null,
logdate date not null,
peaktempint,
unitsales   int
) PARTITION BY RANGE (logdate);


CREATE TABLE measurement_y2006m03 PARTITION OF measurement
FOR VALUES FROM ('2006-03-01') TO ('2006-04-01')


Assume that now I want to get measurements for March:

There are three ways to write this query:

select * from measurement where extract(month from logdate) = 3;
select * from measurement where logdate between '2006-03-01' AND 
'2006-03-31';
select * from measurement where logdate >= '2006-03-01' AND logdate  < 
'2006-04-01';


Right now only for the last query optimal query plan will be constructed.
Unfortunately my patch is not covering date type.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] JIT compiling expressions/deform + inlining prototype v2.0

2017-09-04 Thread Konstantin Knizhnik



On 01.09.2017 09:41, Andres Freund wrote:

Hi,

I previously had an early prototype of JITing [1] expression evaluation
and tuple deforming.  I've since then worked a lot on this.

Here's an initial, not really pretty but functional, submission. This
supports all types of expressions, and tuples, and allows, albeit with
some drawbacks, inlining of builtin functions.  Between the version at
[1] and this I'd done some work in c++, because that allowed to
experiment more with llvm, but I've now translated everything back.
Some features I'd to re-implement due to limitations of C API.


I've whacked this around quite heavily today, this likely has some new
bugs, sorry for that :(


Can you please clarify the following fragment calculating attributes 
alignment:



/* compute what following columns are aligned to */
+if (att->attlen < 0)
+{
+/* can't guarantee any alignment after varlen field */
+attcuralign = -1;
+}
+else if (att->attnotnull && attcuralign >= 0)
+{
+Assert(att->attlen > 0);
+attcuralign += att->attlen;
+}
+else if (att->attnotnull)
+{
+/*
+ * After a NOT NULL fixed-width column, alignment is
+ * guaranteed to be the minimum of the forced alignment and
+ * length.  XXX
+ */
+attcuralign = alignto + att->attlen;
+Assert(attcuralign > 0);
+}
+else
+{
+//elog(LOG, "attnotnullreset: %d", attnum);
+attcuralign = -1;
+}


I wonder why in this branch (att->attnotnull && attcuralign >= 0)
we are not adding "alignto" and comment in the following branch else if 
(att->attnotnull)
seems to be not related to this branch, because in this case attcuralign 
is expected to be less then zero wjhich means that previous attribute is 
varlen field.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Secondary index access optimizations

2017-09-04 Thread Konstantin Knizhnik



On 04.09.2017 12:59, Amit Langote wrote:

Hi Konstantin,

On 2017/09/04 18:19, Konstantin Knizhnik wrote:

On 04.09.2017 05:38, Amit Langote wrote:

On 2017/09/02 12:44, Thomas Munro wrote:

On Wed, Aug 16, 2017 at 9:23 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

postgres=# explain select * from bt where k between 1 and 2 and v
= 100;
QUERY PLAN
--
   Append  (cost=0.29..15.63 rows=2 width=8)
 ->  Index Scan using dti1 on dt1  (cost=0.29..8.30 rows=1 width=8)
   Index Cond: (v = 100)
 ->  Index Scan using dti2 on dt2  (cost=0.29..7.33 rows=1 width=8)
   Index Cond: (v = 100)
   Filter: (k <= 2)
(6 rows)

+1

This seems like a good feature to me: filtering stuff that is
obviously true is a waste of CPU cycles and may even require people to
add redundant stuff to indexes.  I was pondering something related to
this over in the partition-wise join thread (join quals that are
implied by partition constraints and should be discarded).

It'd be interesting to get Amit Langote's feedback, so I CC'd him.
I'd be surprised if he and others haven't got a plan or a patch for
this down the back of the sofa.

I agree that that's a good optimization in the cases it's correct.  Given
that check_index_predicates() already applies the same optimization when
considering using a partial index, it might make sense to try to do the
same even earlier for the table itself using its CHECK / NOT NULL
constraints as predicates (I said *earlier* because
relation_excluded_by_constrains happens for a relation before we look at
its indexes).  Also, at the end of relation_excluded_by_constraints() may
not be such a bad place to do this.

By the way, I read in check_index_predicates() that we should not apply
this optimization if the relation in question is a target of UPDATE /
DELETE / SELECT FOR UPDATE.

Please correct me if I wrong, but it seems to me that in case of table
constraints it is not necessary to specially handle update case.
As far as I understand we need to leave predicate in the plan in case of
partial indexes because due to "read committed" isolation policy
we may need to recheck that tuple still satisfies update condition (tuple
can be changed by some other committed transaction while we are waiting
for it and not satisfying this condition any more).
But no transaction can change tuple in such way that it violates table
constraints,  right? So we do not need to recheck it.

Actually, I don't really know why check_index_predicates() skips this
optimization in the target relation case, just wanted to point out that
that's so.

Thinking a bit from what you wrote, maybe we need not worry about
EvalPlanQual in the context of your proposed optimization based on the
table's constraints.


Concerning your suggestion to merge check_index_predicates() and
remove_restrictions_implied_by_constraints() functions: may be it can be
done, but frankly speaking I do not see much sense in it - there are too
much differences between this functions and too few code reusing.

Maybe, you meant to address Thomas here. :)  Reading his comment again, I
too am a bit concerned about destructively modifying the input rel's
baserestrictinfo.  There should at least be a comment that that's being done.
But I have considered Thomas comment and extracted code updating 
relation's baserestrictinfo from
relation_excluded_by_constraints() to 
remove_restrictions_implied_by_constraints() function. It was included 
in new version of the patch.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Secondary index access optimizations

2017-09-04 Thread Konstantin Knizhnik



On 04.09.2017 05:38, Amit Langote wrote:

On 2017/09/02 12:44, Thomas Munro wrote:

On Wed, Aug 16, 2017 at 9:23 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

postgres=# explain select * from bt where k between 1 and 2 and v = 100;
   QUERY PLAN
--
  Append  (cost=0.29..15.63 rows=2 width=8)
->  Index Scan using dti1 on dt1  (cost=0.29..8.30 rows=1 width=8)
  Index Cond: (v = 100)
->  Index Scan using dti2 on dt2  (cost=0.29..7.33 rows=1 width=8)
  Index Cond: (v = 100)
  Filter: (k <= 2)
(6 rows)

+1

This seems like a good feature to me: filtering stuff that is
obviously true is a waste of CPU cycles and may even require people to
add redundant stuff to indexes.  I was pondering something related to
this over in the partition-wise join thread (join quals that are
implied by partition constraints and should be discarded).

It'd be interesting to get Amit Langote's feedback, so I CC'd him.
I'd be surprised if he and others haven't got a plan or a patch for
this down the back of the sofa.

I agree that that's a good optimization in the cases it's correct.  Given
that check_index_predicates() already applies the same optimization when
considering using a partial index, it might make sense to try to do the
same even earlier for the table itself using its CHECK / NOT NULL
constraints as predicates (I said *earlier* because
relation_excluded_by_constrains happens for a relation before we look at
its indexes).  Also, at the end of relation_excluded_by_constraints() may
not be such a bad place to do this.

By the way, I read in check_index_predicates() that we should not apply
this optimization if the relation in question is a target of UPDATE /
DELETE / SELECT FOR UPDATE.
Please correct me if I wrong, but it seems to me that in case of table 
constraints it is not necessary to specially handle update case.
As far as I understand we need to leave predicate in the plan in case of 
partial indexes because due to "read committed" isolation policy
we may need to recheck that tuple still satisfies update condition 
(tuple can be changed by some other committed transaction while we are 
waiting for it and not satisfying this condition any more).
But no transaction can change tuple in such way that it violates table 
constraints,  right? So we do not need to recheck it.


Concerning your suggestion to merge check_index_predicates() and 
remove_restrictions_implied_by_constraints() functions: may be it can be 
done, but frankly speaking I do not see much sense in it - there are too 
much differences between this functions and too few code reusing.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] JIT & function naming

2017-09-03 Thread Konstantin Knizhnik

On 09/03/2017 02:59 AM, Andres Freund wrote:

Hi,

On 2017-08-31 23:41:31 -0700, Andres Freund wrote:

I previously had an early prototype of JITing [1] expression evaluation
and tuple deforming.  I've since then worked a lot on this.

Here's an initial, not really pretty but functional, submission.

One of the things I'm not really happy about yet is the naming of the
generated functions. Those primarily matter when doing profiling, where
the function name will show up when the profiler supports JIT stuff
(e.g. with a patch I proposed to LLVM that emits perf compatible output,
there's also existing LLVM support for a profiler by intel and
oprofile).

Currently there's essentially a per EState counter and the generated
functions get named deform$n and evalexpr$n. That allows for profiling
of a single query, because different compiled expressions are
disambiguated. It even allows to run the same query over and over, still
giving meaningful results.  But it breaks down when running multiple
queries while profiling - evalexpr0 can mean something entirely
different for different queries.

The best idea I have so far would be to name queries like
evalexpr_$fingerprint_$n, but for that we'd need fingerprinting support
outside of pg_stat_statement, which seems painful-ish.

Perhaps somebody has a better idea?


As far as I understand we do not need precise fingerprint.
So may be just calculate some lightweight fingerprint?
For example take query text (es_sourceText from EText), replace all 
non-alphanumeric characters spaces with '_' and take first N (16?) characters 
of the result?
It seems to me that in most cases it will help to identify the query...


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Secondary index access optimizations

2017-09-02 Thread Konstantin Knizhnik

On 09/02/2017 06:44 AM, Thomas Munro wrote:

On Wed, Aug 16, 2017 at 9:23 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

postgres=# explain select * from bt where k between 1 and 2 and v = 100;
   QUERY PLAN
--
  Append  (cost=0.29..15.63 rows=2 width=8)
->  Index Scan using dti1 on dt1  (cost=0.29..8.30 rows=1 width=8)
  Index Cond: (v = 100)
->  Index Scan using dti2 on dt2  (cost=0.29..7.33 rows=1 width=8)
  Index Cond: (v = 100)
  Filter: (k <= 2)
(6 rows)

+1

This seems like a good feature to me: filtering stuff that is
obviously true is a waste of CPU cycles and may even require people to
add redundant stuff to indexes.  I was pondering something related to
this over in the partition-wise join thread (join quals that are
implied by partition constraints and should be discarded).

It'd be interesting to get Amit Langote's feedback, so I CC'd him.
I'd be surprised if he and others haven't got a plan or a patch for
this down the back of the sofa.

I might be missing some higher level architectural problems with the
patch, but for what it's worth here's some feedback after a first read
through:

--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo *root,
  if (predicate_refuted_by(safe_constraints, rel->baserestrictinfo, false))
  return true;

+/*
+ * Remove from restrictions list items implied by table constraints
+ */
+safe_restrictions = NULL;
+foreach(lc, rel->baserestrictinfo)
+{
+RestrictInfo *rinfo = (RestrictInfo *) lfirst(lc);

I think the new way to write this is "RestrictInfo *rinfo =
lfirst_node(RestrictInfo, lc)".

+if (!predicate_implied_by(list_make1(rinfo->clause),
safe_constraints, false)) {
+safe_restrictions = lappend(safe_restrictions, rinfo);
+}
+}
+rel->baserestrictinfo = safe_restrictions;

It feels wrong to modify rel->baserestrictinfo in
relation_excluded_by_constraints().  I think there should probably be
a function with a name that more clearly indicates that it mutates its
input, and you should call that separately after
relation_excluded_by_constraints().  Something like
remove_restrictions_implied_by_constraints()?


It is because operator_predicate_proof is not able to understand that k <
20001 and k <= 2 are equivalent for integer type.

[...]

  /*
   * operator_predicate_proof
  if (clause_const->constisnull)
  return false;

+if (!refute_it
+&& ((pred_op == Int4LessOrEqualOperator && clause_op ==
Int4LessOperator)
+|| (pred_op == Int8LessOrEqualOperator && clause_op ==
Int8LessOperator)
+|| (pred_op == Int2LessOrEqualOperator && clause_op ==
Int2LessOperator))
+&& pred_const->constbyval && clause_const->constbyval
+&& pred_const->constvalue + 1 == clause_const->constvalue)
+{
+return true;
+}
+

I'm less sure about this part.  It seems like a slippery slope.

A couple of regression test failures:

  inherit  ... FAILED
  rowsecurity  ... FAILED

  2 of 179 tests failed.


I didn't try to understand the rowsecurity one, but at first glance
all the differences reported in the inherit test are in fact cases
where your patch is doing the right thing and removing redundant
filters from scans.  Nice!


Thank you for review.
I attached new version of the patch with 
remove_restrictions_implied_by_constraints() function.
Concerning failed tests - this is actually result of this optimization: extra 
filter conditions are removed from query plans.
Sorry, I have not included updated version of expected test output files to the 
patch.
Now I did it.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/optimizer/path/allpaths.c b/src/backend/optimizer/path/allpaths.c
index 2d7e1d8..5de67ce 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -344,6 +344,7 @@ set_rel_size(PlannerInfo *root, RelOptInfo *rel,
 		switch (rel->rtekind)
 		{
 			case RTE_RELATION:
+remove_restrictions_implied_by_constraints(root, rel, rte);
 if (rte->relkind == RELKIND_FOREIGN_TABLE)
 {
 	/* Foreign table */
@@ -1047,6 +1048,7 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
 			set_dummy_rel_pathlist(childrel);
 			continue;
 		}
+		remove_restrictions_implied_by_constraints(root, childrel, childRTE);
 
 		/*
 		 * CE failed, so finish copying/modifying targetlist and join quals.
diff --git a/src/backend/optimizer/util/plancat.c b/src/backend/op

Re: [HACKERS] Surjective functional indexes

2017-09-01 Thread Konstantin Knizhnik


On 01.09.2017 09:25, Simon Riggs wrote:

On 1 September 2017 at 05:40, Thomas Munro
<thomas.mu...@enterprisedb.com> wrote:

On Fri, Jun 9, 2017 at 8:08 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Attached please find rebased version of the patch.
Now "projection" attribute is used instead of surjective/injective.

Hi Konstantin,

This still applies but it doesn't compile after commits 2cd70845 and
c6293249.  You need to change this:

   Form_pg_attribute att = RelationGetDescr(indexDesc)->attrs[i];

... to this:

   Form_pg_attribute att = TupleDescAttr(RelationGetDescr(indexDesc), i);

Thanks!

Does the patch work fully with that change? If so, I will review.


Attached please find rebased version of the patch.
Yes, I checked that it works after this fix.
Thank you in advance for review.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 83ee7d3..52189ac 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -294,8 +294,33 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 The optional WITH clause specifies storage
 parameters for the index.  Each index method has its own set of allowed
-storage parameters.  The B-tree, hash, GiST and SP-GiST index methods all
-accept this parameter:
+storage parameters. All indexes accept the following parameter:
+   
+
+   
+   
+projection
+
+ 
+   Functional index is based on on projection function: function which extract subset of its argument.
+   In mathematic such functions are called non-injective. For injective function if any attribute used in the indexed
+   expression is changed, then value of index expression is also changed. So to check that index is affected by the
+   update, it is enough to check the set of changed fields. By default this parameters is assigned true value and function is considered
+   as non-injective.
+   In this case change of any of indexed key doesn't mean that value of the function is changed. For example, for
+   the expression expression(bookinfo-'isbn') defined
+   for column of JSON type is changed only when ISBN is changed, which rarely happen. The same is true for most
+   functional indexes. For non-injective functions, Postgres compares values of indexed expression for old and updated tuple and updates
+   index only when function results are different. It allows to eliminate index update and use HOT update.
+   But there are extra evaluations of the functions. So if function is expensive or probability that change of indexed column will not effect
+   the function value is small, then marking index as projection may increase update speed.
+
+
+   
+   
+
+   
+ The B-tree, hash, GiST and SP-GiST index methods all accept this parameter:

 

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index ec10762..b73165f 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -130,6 +130,15 @@ static relopt_bool boolRelOpts[] =
 	},
 	{
 		{
+			"projection",
+			"Evaluate functional index expression on update to check if its values is changed",
+			RELOPT_KIND_INDEX,
+			AccessExclusiveLock
+		},
+		true
+	},
+	{
+		{
 			"security_barrier",
 			"View acts as a row security barrier",
 			RELOPT_KIND_VIEW,
@@ -1301,7 +1310,7 @@ fillRelOptions(void *rdopts, Size basesize,
 break;
 			}
 		}
-		if (validate && !found)
+		if (validate && !found && options[i].gen->kinds != RELOPT_KIND_INDEX)
 			elog(ERROR, "reloption \"%s\" not found in parse table",
  options[i].gen->name);
 	}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e29c5ad..05e372f 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -56,6 +56,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/index.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "port/atomics.h"
@@ -74,7 +75,9 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
-
+#include "utils/memutils.h"
+#include "nodes/execnodes.h"
+#include "executor/executor.h"
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -126,6 +129,7 @@ static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status
 static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
 static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_modified,
 	

Re: [HACKERS] [PATCH] Push limit to sort through a subquery

2017-08-23 Thread Konstantin Knizhnik



On 22.08.2017 17:27, Konstantin Knizhnik wrote:



On 18.08.2017 04:33, Robert Haas wrote:


It seems like a somewhat ad-hoc approach; it supposes that we can 
take any query produced by deparseSelectStmtForRel() and stick a 
LIMIT clause onto the very end and all will be well.  Maybe that's 
not a problematic assumption, not sure.  The grammar happens to allow 
both FOR UPDATE LIMIT n and LIMIT n FOR UPDATE even though only the 
latter syntax is documented.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



I am not absolutely sure that it is possible to append any query which 
can be constructed by postgres_fdw for foreign scan with "LIMIT n" clause.
But I also do not know example when it is not possible. As you have 
mentioned, "FOR UPDATE LIMIT n" is currently recognized by Postgres.




I have inspected deparseSelectStmtForRel function and now I am sure that 
appending LIMIT to the SQL statement generated by this function will not 
cause any problems.

It can produce only the following subset of SELECT:

select  FROM  [GROUP BY ... [ HAVING ... ]] [ 
OREDER BY ... ] [ FOR UPDATE ... ];



The only suspicious clause is FOR UPDATE, but I have checked that "FOR 
UPDATE ... LIMIT n" is  really accepted by Postgres parser.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

2017-08-22 Thread Konstantin Knizhnik



On 18.08.2017 04:33, Robert Haas wrote:


It seems like a somewhat ad-hoc approach; it supposes that we can take 
any query produced by deparseSelectStmtForRel() and stick a LIMIT 
clause onto the very end and all will be well.  Maybe that's not a 
problematic assumption, not sure.  The grammar happens to allow both 
FOR UPDATE LIMIT n and LIMIT n FOR UPDATE even though only the latter 
syntax is documented.


--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



I am not absolutely sure that it is possible to append any query which 
can be constructed by postgres_fdw for foreign scan with "LIMIT n" clause.
But I also do not know example when it is not possible. As you have 
mentioned, "FOR UPDATE LIMIT n" is currently recognized by Postgres.


Can you suggest how to implement limit push down to FDW in better way?
Move deparseSelectStmtForRel() from postgresGetForeignPlan to 
postgresIterateForeignScan ?
It seems to be problematic because many information required by 
deparseSelectStmtForRel is not available in postgresIterateForeignScan.
In principle, it is possible to somehow propagate it here. But from my 
point of view it is not right approach...


IMHO there is some contradiction in Postgres optimizer that static 
information about limit is not taken in account at the planning stage 
and is actually used only during query execution,
when pass_down_bound() function is called to propagate knowledge about 
limit down through plan nodes. Certainly I understand that it gives more 
flexibility: we can use information from

previous steps of query execution which was not available at planning stage.

But pushing down limit at planning stage requires too much changes. And 
the proposed patch is very small and non-invasive. And in principle, it 
can be used not only postgres_fdw, but also in other FDW implementations 
to push down information about LIMIT.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] [PATCH] Push limit to sort through a subquery

2017-08-17 Thread Konstantin Knizhnik



On 29.04.2017 00:13, Douglas Doole wrote:


If you add this to the commitfest app, more people might look at
it when the next commitfest opens.


I have added it. https://commitfest.postgresql.org/14/1119/

Also, it might help if you can provide a query/ies with numbers
where this optimization shows improvement.


I can't provide the real queries where we encountered the problem 
because they are internal. However I showed a simplified version of 
the queries in my first post.


On our queries, the change made quite a difference - execution time 
dropped from 31.4 seconds to 7.2 seconds. Explain analyze also shows 
that memory use dropped significantly and we didn't have to spill the 
sort to disk


From:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) 
(node_startup_time/loop=31328.891, node_total_time/loop: 31329.756 
rows=2001 loops=1) Buffers: temp read=772 written=11201 lsm_bufmgr 
hits=3392 Sort Key: *** Sort Method: external merge Sort Space Used: 
89592 Sort Space Type: Disk


To:

-> Sort (cost=989.95..1013.27 rows=9326 width=30) 
(node_startup_time/loop=7123.275, node_total_time/loop: 7123.504 
rows=2001 loops=1) Buffers: lsm_bufmgr hits=3387 Sort Key: *** Sort 
Method: top-N heapsort Sort Space Used: 3256 Sort Space Type: Memory


Attached please find yet another small patch which pushes down LIMIT to 
ForeignScan.
I should notice that currently Postgres optimizer is using "Merge 
Append" and fetches from remote nodes only required number of tuples.
So even without LIMIT push down, postgres_fdw will not pull the whole 
table from remote host.
postgres_fdw is using cursor for fetching data from remote. Default 
fetch size is 100, so even without limit remote query will fetch no 
more  than 100 rows at remote site.


Assume the following example:

postgres=# create extension postgres_fdw;
postgres=# create server shard1  FOREIGN DATA WRAPPER postgres_fdw 
options(dbname 'postgres', host 'localhost', port '5432');
postgres=# create server shard2  FOREIGN DATA WRAPPER postgres_fdw 
options(dbname 'postgres', host 'localhost', port '5432');
postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user 
'$user');
postgres=# CREATE USER MAPPING for $user SERVER shard1 options (user 
'$user');

postgres=# CREATE TABLE t(u integer primary key, v integer);
postgres=# CREATE TABLE t1(u integer primary key, v integer);
postgres=# CREATE TABLE t2(u integer primary key, v integer);
postgres=# insert into t1 values (generate_series(1,10), 
random()*10);
postgres=# insert into t2 values (generate_series(1,10), 
random()*10);
postgres=# CREATE FOREIGN TABLE t_fdw1() inherits (t) server shard1 
options(table_name 't1');
postgres=# CREATE FOREIGN TABLE t_fdw2() inherits (t) server shard2 
options(table_name 't2');



postgres=# explain analyze select * from t order by u limit 1;
  QUERY PLAN
---
 Limit  (cost=200.15..200.20 rows=1 width=8) (actual time=2.010..2.010 
rows=1 loops=1)
   ->  Merge Append  (cost=200.15..449.39 rows=5121 width=8) (actual 
time=2.009..2.009 rows=1 loops=1)

 Sort Key: t.u
 ->  Index Scan using t_pkey on t  (cost=0.12..8.14 rows=1 
width=8) (actual time=0.005..0.005 rows=0 loops=1)
 ->  Foreign Scan on t_fdw2  (cost=100.00..193.92 rows=2560 
width=8) (actual time=1.074..1.074rows=1 loops=1)
 ->  Foreign Scan on t_fdw1  (cost=100.00..193.92 rows=2560 
width=8) (actual time=0.928..0.928rows=1 loops=1)

 Planning time: 0.769 ms
 Execution time: 6.837 ms
(8 rows)

As you can see foreign scan fetches only one row from each remote node.

But still pushing down limit can have positive effect on performance, 
especially if SORT can be replaced with TOP-N.

I got the following results (time in seconds):

Query
original
limit push down
select * from t order by u limit 1
2.276
1.777
select * from t order by v limit 1
100 42


There is index for "u", so fetching records with smallest "u" values can 
be done without sorting, so times are similar.
But in case of sorting by "v", pushing down limit allows to use TOP-1 
instead of global sort and it reduces query execution time more than 2 
times.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 080cb0a..e3847ce 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -2949,7 +2949,8 @@ create_cursor(ForeignScanState *node)
 	initStringInfo();
 	appendStringInfo(, "DECLARE c%u CURSOR FOR\n%s",
 	 fsstate->cursor_number, fsstate->query);
-
+	if (node->limit > 0)
+		appendStringInfo(, " LIMIT %l

Re: [HACKERS] Secondary index access optimizations

2017-08-16 Thread Konstantin Knizhnik

On 14.08.2017 19:33, Konstantin Knizhnik wrote:



On 14.08.2017 12:37, Konstantin Knizhnik wrote:

Hi hackers,

I am trying to compare different ways of optimizing work with huge 
append-only tables in PostgreSQL where primary key is something like 
timestamp and queries are usually accessing most recent data using 
some secondary keys. Size of secondary index is one of the most 
critical factors limiting  insert/search performance. As far as data 
is inserted in timestamp ascending order, access to primary key is 
well localized and accessed tables are present in memory. But if we 
create secondary key for the whole table, then access to it will 
require random reads from the disk and significantly decrease 
performance.


There are two well known solutions of the problem:
1. Table partitioning
2. Partial indexes

This approaches I want to compare. First of all I want to check if 
optimizer is able to generate efficient query execution plan covering 
different time intervals.

Unfortunately in both cases generated plan is not optimal.

1. Table partitioning:

create table base (k integer primary key, v integer);
create table part1 (check (k between 1 and 1)) inherits (base);
create table part2 (check (k between 10001 and 2)) inherits (base);
create index pi1 on part1(v);
create index pi2 on part2(v);
insert int part1 values (generate series(1,1), random());
insert into part2 values (generate_series(10001,2), random());
explain select * from base where k between 1 and 2 and v = 100;
  QUERY PLAN
---
 Append  (cost=0.00..15.65 rows=3 width=8)
   ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=8)
 Filter: ((k >= 1) AND (k <= 2) AND (v = 100))
   ->  Index Scan using pi1 on part1  (cost=0.29..8.31 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))
   ->  Index Scan using pi2 on part2  (cost=0.29..7.34 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))

Questions:
- Is there some way to avoid sequential scan of parent table? Yes, it 
is empty and so sequential scan will not take much time, but ... it 
still requires some additional actions and so increasing query 
execution time.
- Why index scan of partition indexes includes filter condition if it 
is guaranteed by check constraint that all records of this partition 
match search predicate?



2. Partial indexes:

create table t (k integer primary key, v integer);
insert into t values (generate_series(1,2),random());
create index i1 on t(v) where k between 1 and 1;
create index i2 on t(v) where k between 10001 and 2;
postgres=# explain select * from t where k between 1 and 1 and v 
= 100;

 QUERY PLAN

 Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
   Index Cond: (v = 100)
(2 rows)


Here we get perfect plan. Let's try to extend search interval:


postgres=# explain select * from t where k between 1 and 2 and v 
= 100;

QUERY PLAN
--
 Index Scan using t_pkey on t  (cost=0.29..760.43 rows=1 width=8)
   Index Cond: ((k >= 1) AND (k <= 2))
   Filter: (v = 100)
(3 rows)

Unfortunately in this case Postgres is not able to apply partial 
indexes.

And this is what I expected to get:

postgres=# explain select * from t where k between 1 and 1 and v 
= 100 union all select * from t where k between 10001 and 2 and v 
= 100;

  QUERY PLAN
--
 Append  (cost=0.29..14.58 rows=2 width=8)
   ->  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)
   ->  Index Scan using i2 on t t_1  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)


I wonder if there are some principle problems in supporting this two 
things in optimizer:
1. Remove search condition for primary key if it is fully satisfied 
by derived table check constraint.
2. Append index scans of several partial indexes if specified 
interval is covered by their conditions.


I wonder if someone is familiar with this part of optimizer ad can 
easily fix it.
Otherwise I am going to spend some time on solving this problems (if 
community think that such optimizations will be useful).




Replying to myself: the following small patch removes redundant checks 
from index scans for derived tables:



diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c

index 939045d..1f7c9cf 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo 
*root,
if (predicate_refuted_by(s

Re: [HACKERS] Secondary index access optimizations

2017-08-14 Thread Konstantin Knizhnik



On 14.08.2017 12:37, Konstantin Knizhnik wrote:

Hi hackers,

I am trying to compare different ways of optimizing work with huge 
append-only tables in PostgreSQL where primary key is something like 
timestamp and queries are usually accessing most recent data using 
some secondary keys. Size of secondary index is one of the most 
critical factors limiting  insert/search performance. As far as data 
is inserted in timestamp ascending order, access to primary key is 
well localized and accessed tables are present in memory. But if we 
create secondary key for the whole table, then access to it will 
require random reads from the disk and significantly decrease 
performance.


There are two well known solutions of the problem:
1. Table partitioning
2. Partial indexes

This approaches I want to compare. First of all I want to check if 
optimizer is able to generate efficient query execution plan covering 
different time intervals.

Unfortunately in both cases generated plan is not optimal.

1. Table partitioning:

create table base (k integer primary key, v integer);
create table part1 (check (k between 1 and 1)) inherits (base);
create table part2 (check (k between 10001 and 2)) inherits (base);
create index pi1 on part1(v);
create index pi2 on part2(v);
insert int part1 values (generate series(1,1), random());
insert into part2 values (generate_series(10001,2), random());
explain select * from base where k between 1 and 2 and v = 100;
  QUERY PLAN
---
 Append  (cost=0.00..15.65 rows=3 width=8)
   ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=8)
 Filter: ((k >= 1) AND (k <= 2) AND (v = 100))
   ->  Index Scan using pi1 on part1  (cost=0.29..8.31 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))
   ->  Index Scan using pi2 on part2  (cost=0.29..7.34 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))

Questions:
- Is there some way to avoid sequential scan of parent table? Yes, it 
is empty and so sequential scan will not take much time, but ... it 
still requires some additional actions and so increasing query 
execution time.
- Why index scan of partition indexes includes filter condition if it 
is guaranteed by check constraint that all records of this partition 
match search predicate?



2. Partial indexes:

create table t (k integer primary key, v integer);
insert into t values (generate_series(1,2),random());
create index i1 on t(v) where k between 1 and 1;
create index i2 on t(v) where k between 10001 and 2;
postgres=# explain select * from t where k between 1 and 1 and v = 
100;

 QUERY PLAN

 Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
   Index Cond: (v = 100)
(2 rows)


Here we get perfect plan. Let's try to extend search interval:


postgres=# explain select * from t where k between 1 and 2 and v = 
100;

QUERY PLAN
--
 Index Scan using t_pkey on t  (cost=0.29..760.43 rows=1 width=8)
   Index Cond: ((k >= 1) AND (k <= 2))
   Filter: (v = 100)
(3 rows)

Unfortunately in this case Postgres is not able to apply partial indexes.
And this is what I expected to get:

postgres=# explain select * from t where k between 1 and 1 and v = 
100 union all select * from t where k between 10001 and 2 and v = 
100;

  QUERY PLAN
--
 Append  (cost=0.29..14.58 rows=2 width=8)
   ->  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)
   ->  Index Scan using i2 on t t_1  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)


I wonder if there are some principle problems in supporting this two 
things in optimizer:
1. Remove search condition for primary key if it is fully satisfied by 
derived table check constraint.
2. Append index scans of several partial indexes if specified interval 
is covered by their conditions.


I wonder if someone is familiar with this part of optimizer ad can 
easily fix it.
Otherwise I am going to spend some time on solving this problems (if 
community think that such optimizations will be useful).




Replying to myself: the following small patch removes redundant checks 
from index scans for derived tables:



diff --git a/src/backend/optimizer/util/plancat.c 
b/src/backend/optimizer/util/plancat.c

index 939045d..1f7c9cf 100644
--- a/src/backend/optimizer/util/plancat.c
+++ b/src/backend/optimizer/util/plancat.c
@@ -1441,6 +1441,20 @@ relation_excluded_by_constraints(PlannerInfo *root,
if (predicate_refuted_by(safe_constraints, 
rel->baserestrict

[HACKERS] Secondary index access optimizations

2017-08-14 Thread Konstantin Knizhnik

Hi hackers,

I am trying to compare different ways of optimizing work with huge 
append-only tables in PostgreSQL where primary key is something like 
timestamp and queries are usually accessing most recent data using some 
secondary keys. Size of secondary index is one of the most critical 
factors limiting  insert/search performance. As far as data is inserted 
in timestamp ascending order, access to primary key is well localized 
and accessed tables are present in memory. But if we create secondary 
key for the whole table, then access to it will require random reads 
from the disk and significantly decrease performance.


There are two well known solutions of the problem:
1. Table partitioning
2. Partial indexes

This approaches I want to compare. First of all I want to check if 
optimizer is able to generate efficient query execution plan covering 
different time intervals.

Unfortunately in both cases generated plan is not optimal.

1. Table partitioning:

create table base (k integer primary key, v integer);
create table part1 (check (k between 1 and 1)) inherits (base);
create table part2 (check (k between 10001 and 2)) inherits (base);
create index pi1 on part1(v);
create index pi2 on part2(v);
insert int part1 values (generate series(1,1), random());
insert into part2 values (generate_series(10001,2), random());
explain select * from base where k between 1 and 2 and v = 100;
  QUERY PLAN
---
 Append  (cost=0.00..15.65 rows=3 width=8)
   ->  Seq Scan on base  (cost=0.00..0.00 rows=1 width=8)
 Filter: ((k >= 1) AND (k <= 2) AND (v = 100))
   ->  Index Scan using pi1 on part1  (cost=0.29..8.31 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))
   ->  Index Scan using pi2 on part2  (cost=0.29..7.34 rows=1 width=8)
 Index Cond: (v = 100)
 Filter: ((k >= 1) AND (k <= 2))

Questions:
- Is there some way to avoid sequential scan of parent table? Yes, it is 
empty and so sequential scan will not take much time, but ... it still 
requires some additional actions and so increasing query execution time.
- Why index scan of partition indexes includes filter condition if it is 
guaranteed by check constraint that all records of this partition match 
search predicate?



2. Partial indexes:

create table t (k integer primary key, v integer);
insert into t values (generate_series(1,2),random());
create index i1 on t(v) where k between 1 and 1;
create index i2 on t(v) where k between 10001 and 2;
postgres=# explain select * from t where k between 1 and 1 and v = 100;
 QUERY PLAN

 Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
   Index Cond: (v = 100)
(2 rows)


Here we get perfect plan. Let's try to extend search interval:


postgres=# explain select * from t where k between 1 and 2 and v = 100;
QUERY PLAN
--
 Index Scan using t_pkey on t  (cost=0.29..760.43 rows=1 width=8)
   Index Cond: ((k >= 1) AND (k <= 2))
   Filter: (v = 100)
(3 rows)

Unfortunately in this case Postgres is not able to apply partial indexes.
And this is what I expected to get:

postgres=# explain select * from t where k between 1 and 1 and v = 
100 union all select * from t where k between 10001 and 2 and v = 100;

  QUERY PLAN
--
 Append  (cost=0.29..14.58 rows=2 width=8)
   ->  Index Scan using i1 on t  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)
   ->  Index Scan using i2 on t t_1  (cost=0.29..7.28 rows=1 width=8)
 Index Cond: (v = 100)


I wonder if there are some principle problems in supporting this two 
things in optimizer:
1. Remove search condition for primary key if it is fully satisfied by 
derived table check constraint.
2. Append index scans of several partial indexes if specified interval 
is covered by their conditions.


I wonder if someone is familiar with this part of optimizer ad can 
easily fix it.
Otherwise I am going to spend some time on solving this problems (if 
community think that such optimizations will be useful).


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Error : undefined symbol : LWLockAssign in 9.6.3

2017-08-08 Thread Konstantin Knizhnik

On 08/09/2017 07:07 AM, 송기훈 wrote:

본문 이미지 1
Hi.
I'm trying to use imcs module with 9.6 and got this error message. LWLockAssign 
function has been deleted from 9.6. I can't use this module anymore from 9.6.

What I want to ask you something is that your team decides not to support imcs 
module anymore or doesn't concern about imcs module or are there any ways to 
run postgresql in memory only?


Hi,
I am author of IMCS module and performing support of it.
Please contact to me directly.
I have committed patch in https://github.com/knizhnik/imcs.git repository
which allows to use IMCS with 9.6.3 and later Postgres versions.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] ASOF join

2017-06-21 Thread Konstantin Knizhnik



On 21.06.2017 11:00, Thomas Munro wrote:

Hmm.  Yeah, I see the notational problem.  It's hard to come up with a
new syntax that has SQL nature.  What if... we didn't use a new syntax
at all, but recognised existing queries that are executable with this
strategy?  Queries like this:

WITH ticks(time, price) AS
(VALUES ('2017-07-20 12:00:00'::timestamptz, 100.00),
('2017-07-21 11:00:00'::timestamptz, 150.00)),
  times(time) AS
(VALUES ('2017-07-19 12:00:00'::timestamptz),
('2017-07-20 12:00:00'::timestamptz),
('2017-07-21 12:00:00'::timestamptz),
('2017-07-22 12:00:00'::timestamptz))

SELECT times.time, previous_tick.price
   FROM times
   LEFT JOIN LATERAL (SELECT * FROM ticks
   WHERE ticks.time <= times.time
   ORDER BY ticks.time DESC LIMIT 1) previous_tick ON true
  ORDER BY times.time;

   time  | price
+
  2017-07-19 12:00:00+12 |
  2017-07-20 12:00:00+12 | 100.00
  2017-07-21 12:00:00+12 | 150.00
  2017-07-22 12:00:00+12 | 150.00
(4 rows)

I haven't used LATERAL much myself but I've noticed that it's often
used to express this type of thing.  "Get me the latest ... as of time
...".

It'd a bit like the way we recognise EXISTS (...) as a semi-join and
execute it with a join operator instead of having a SEMI JOIN syntax.
On the other hand it's a bit more long winded, extreme and probably
quite niche.
Thank you for this idea. I agree that it is the best way of implementing 
ASOF join - just as optimization of standard SQL query.
But do you think that still it will be good idea to extend SQL syntax 
with ASOF JOIN ... USING ... clause? It will significantly simplify 
writing queries like above
and IMHO doesn't introduce some confusions with standard SQL syntax. My 
primary idea of suggesting ASOF join for Postgres was not  just building 
more efficient plan (using merge join instead of nested loop) but also 
simplifying writing of such queries. Or do you think that nobody will be 
interested in non-standard SQL extensions?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] ASOF join

2017-06-19 Thread Konstantin Knizhnik



On 16.06.2017 19:07, David Fetter wrote:

On Fri, Jun 16, 2017 at 11:51:34AM +1200, Thomas Munro wrote:

On Fri, Jun 16, 2017 at 4:20 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

I wonder if there were some discussion/attempts to add ASOF join to Postgres
(sorry, may be there is better term for it, I am refereeing KDB definition:
http://code.kx.com/wiki/Reference/aj ).

Interesting idea.  Also in Pandas:

http://pandas.pydata.org/pandas-docs/version/0.19.0/generated/pandas.merge_asof.html#pandas.merge_asof


I attached simple patch adding ASOF join to Postgres. Right now it 
support only outer join and requires USING clause (consequently it is 
not possible to join two tables which joi keys has different names. May 
be it is also possible to support ON clause with condition written like 
o.k1 = i.k2 AND o.k2 = i.k2 AND ... AND o.kN >= i.kN
But such notation can be confusing, because join result includes only 
one matching inner record with kN smaller or equal than kN of outer 
record and not all such records.

As alternative we can add specia

If people fin such construction really useful, I will continue work on it.




I suppose you could write a function that pulls tuples out of a bunch
of cursors and zips them together like this, as a kind of hand-coded
special merge join "except that we match on nearest key rather than
equal keys" (as they put it).

I've written code like this before in a trading context, where we
called that 'previous tick interpolation', and in a scientific context
where other kinds of interpolation were called for (so not really
matching a tuple but synthesising one if no exact match).  If you view
the former case as a kind of degenerate case of interpolation then it
doesn't feel like a "join" as we know it, but clearly it is.  I had
never considered before that such things might belong inside the
database as a kind of join operator.

If you turn your head sideways, it's very similar to the range merge
join Jeff Davis proposed.  https://commitfest.postgresql.org/14/1106/


May be, but I do not understand how to limit result to contain exactly 
one (last) inner tuple for each outer tuple.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/contrib/postgres_fdw/deparse.c b/contrib/postgres_fdw/deparse.c
index 482a3dd..f7a8f38 100644
--- a/contrib/postgres_fdw/deparse.c
+++ b/contrib/postgres_fdw/deparse.c
@@ -1324,6 +1324,9 @@ get_jointype_name(JoinType jointype)
 		case JOIN_FULL:
 			return "FULL";
 
+		case JOIN_ASOF:
+			return "ASOF";
+
 		default:
 			/* Shouldn't come here, but protect from buggy code. */
 			elog(ERROR, "unsupported join type %d", jointype);
diff --git a/contrib/postgres_fdw/postgres_fdw.c b/contrib/postgres_fdw/postgres_fdw.c
index 080cb0a..54cf6c1 100644
--- a/contrib/postgres_fdw/postgres_fdw.c
+++ b/contrib/postgres_fdw/postgres_fdw.c
@@ -4073,7 +4073,7 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype,
 	 * Constructing queries representing SEMI and ANTI joins is hard, hence
 	 * not considered right now.
 	 */
-	if (jointype != JOIN_INNER && jointype != JOIN_LEFT &&
+	if (jointype != JOIN_INNER && jointype != JOIN_LEFT && jointype != JOIN_ASOF && 
 		jointype != JOIN_RIGHT && jointype != JOIN_FULL)
 		return false;
 
@@ -4211,6 +4211,7 @@ foreign_join_ok(PlannerInfo *root, RelOptInfo *joinrel, JoinType jointype,
 			break;
 
 		case JOIN_LEFT:
+		case JOIN_ASOF:
 			fpinfo->joinclauses = list_concat(fpinfo->joinclauses,
 		  list_copy(fpinfo_i->remote_conds));
 			fpinfo->remote_conds = list_concat(fpinfo->remote_conds,
diff --git a/doc/src/sgml/ref/select.sgml b/doc/src/sgml/ref/select.sgml
index 211e4c3..fd3be8c 100644
--- a/doc/src/sgml/ref/select.sgml
+++ b/doc/src/sgml/ref/select.sgml
@@ -514,6 +514,9 @@ TABLE [ ONLY ] table_name [ * ]
  
   CROSS JOIN
  
+ 
+  ASOF [ OUTER ] JOIN
+ 
 
 
 For the INNER and OUTER join types, a
@@ -523,7 +526,9 @@ TABLE [ ONLY ] table_name [ * ]
 USING (join_column [, ...]).
 See below for the meaning.  For CROSS JOIN,
-none of these clauses can appear.
+none of these clauses can appear. For ASOF join type, a
+join condition must be USING (join_column [, ...]).

 

@@ -571,6 +576,32 @@ TABLE [ ONLY ] table_name [ * ]
 on the right), plus one row for each unmatched right-hand row
 (extended with nulls on the left).

+
+   ASOF OUTER JOIN is similar to LEFT OUTER JOIN but it accepts only
+USING (join_column_1 [, ...], join_column_N) clause
+where last joined column join_column_N is expected to be timestamp 
+(but actually can have any comparable type) and outer tuple is matched with only one inner tuple wi

Re: [HACKERS] WIP: Data at rest encryption

2017-06-16 Thread Konstantin Knizhnik



On 16.06.2017 03:08, Bruce Momjian wrote:


Yeah, I guess we will just have to wait to see it since other people are
excited about it.  My concern is code complexity and usability
challenges, vs punting the problem to the operating system, though
admittedly there are some cases where that is not possible.



Let me also share my opinion about encryption and compression support at 
database level.
PostgresPro Enterprise does support both. I have made presentation about 
it at PgConn 2016 in Tallinn.
I was a little bit surprised that there were more questions about 
encryption than about compression.
But right now we have several customers which are using compression and 
none of them use encryption (just because them do not need
to protect their databases). But I absolutely sure that there are many 
Postgres users which first of all need to protect their data.


Encryption is much easier to implement than compression, because it is 
not changing page size. So I do not see any "complexity and flexibility 
challenges" here.
Just for reference I attached to this mail our own encryption patch. I 
do not want to propose it as alternative to Aasmas patch: it is less 
flexible and doesn't support encryption of WAL, just encryption of 
relation data. Also it doesn't allow custom encryption libraries: AES 
implementation is embedded. Encryption cipher is taken from environment 
variable. At Tallin's conferences I was informed about possible security 
issue with passing key through environment variable: it is possible to 
inspect server's environment variables using plpython/plperl stored 
procedure.
This is why we unset this environment variable after reading. I am not 
expect in security, but I do not know other issues with such solution.


Concerning the question whether to implement compression/encryption on 
database level or rely on OS, my opinion is that there are many 
scenarios where it is not possible or is not desirable to use OS level 
encryption/protection. It first of all includes cloud installations and 
embedded applications.  I do not want to repeat arguments already 
mentioned in this thread.
But the fact is that there are many people which really need 
compression/encryption support and them can not or do not want to 
redirect this aspects to OS. Almost all DBMSes are supporting 
compression encryption, so lack of this features in Postgres definitely 
can not be considered as Postgres advantage.


Postgres buffer manager interface significantly simplifies integration 
of encryption and compression. There is actually single path through 
which data is fetched/stored to the disk.
It is most obvious and natural solution to decompress/decrypt data when 
it is read from the disk to page pool and compress/encrypt it when it is 
written back. Taken in account that memory is cheap now and many 
databases can completely fit in memory, storing pages in the buffer 
cache in plain (decompressed/decrypted) format allows to minimize 
overhead of compression/encryption and its influence on performance. For 
read only queries working with cached data performance will be exactly 
the same as without encryption/compression.
Write speed for encrypted pages will be certainly slightly worse, but 
still encryption speed is much higher than disk IO speed.


So I do not think that it is really necessary to support encryption of 
some particular tables, storing "non-secrete" data in plain format 
without encryption. It should not cause noticeable  improve of 
performance, but may complicate implementation and increase possibility 
of leaking secure data.


I do not think that pluggable storage API is right approach to integrate 
compression and especially encryption. It is better to plugin encryption 
between buffer manager and storage device,
allowing to use  it with any storage implementation. Also it is not 
clear to me whether encryption of WAL can be provided using pluggable 
storage API.


The last discussed question is whether it is necessary to encrypt 
temporary data (BufFile). In our solution we encrypt only main fork of 
non-system relations and do no encrypt temporary relations. It may cause 
that some secrete data will be stored at this disk in non-encrypted 
format. But accessing this data is not trivial. You can not just 
copy/stole disk, open database and do "select * from SecreteTable": you 
will have to extract data from raw file yourself. So looks like it is 
better to allow user to make choice whether to encrypt temporary data or 
not.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/storage/file/Makefile b/src/backend/storage/file/Makefile
index d2198f2..9492662 100644
--- a/src/backend/storage/file/Makefile
+++ b/src/backend/storage/file/Makefile
@@ -12,6 +12,6 @@ subdir = src/backend/storage/file
 top_builddir = ../../../..
 include $(top_builddir)/src/Makefile.global
 
-OBJS = fd.o buff

[HACKERS] ASOF join

2017-06-15 Thread Konstantin Knizhnik
I wonder if there were some discussion/attempts to add ASOF join to 
Postgres  (sorry, may be there is better term for it, I am refereeing 
KDB definition: http://code.kx.com/wiki/Reference/aj ).
Such kind of join can be useful when we need to associate two 
timeseries. It is quite popular in trading:


//join the eq price and size for the op trade time
a::aj[`underlyingSym`time;select time, underlyingSym, sym, putorcall, 
ex, cond, price, seqnum, contracts, contractsTraded from t;eqtrades];


...and not only. Below is one example of how people now manually coding 
ASOF join:


select
count(*),
count(*)
filter (where timedelta_prev < -30),
count(*)
filter (where ride_prev = ride_next),
... -- many different aggregates
from
(
select
p.provider_id,
p.ts,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and 
t.source = 'gps'

order by t.ts desc
limit 1
) as timedelta_prev,
(
select extract(epoch from t.ts - p.ts)
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and 
t.source = 'gps'

order by t.ts
limit 1
) as timedelta,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts < p.ts and 
t.source = 'gps'

order by t.ts desc
limit 1
) as ride_prev,
(
select ride_id
from providers_positions t
where p.provider_id = t.provider_id and t.ts > p.ts and 
t.source = 'gps'

order by t.ts
limit 1
) as ride_next
from (
 select
 provider_id,
 ts,
 event_name
 from
 lvl2_681_parsed p
 ) p
where
p.event_name = 'GPS signal restored'
   -- offset 0
) z;

Without OFFSET 0 this query generates awful execution plans with 
hundreds (!) of subplans  corresponding to the subqueries.
Number of subplans (most of them are the same) is equal number of 
occurrences of timedelta, timedelta_prev, ... columns in target aggregates.
OFFSET 0 reduce number of subplans to 4. And I expect that using LATERAL 
join can reduce it to two and even without "OFFSET 0" trick.
But in any case - it is very complex and unnatural way of expressing 
this not so complex query.

With ASOF join is can be written much simpler.

Also Postgres is implementing this query using nested loop with index 
scan, which is definitely not the most efficient strategy.
The best way to implement ASOF join is to use something like mergejoin. 
Usually there are indexes for both timeseries, so what we need is to 
merge two ordered sets using ASOF join rules.
It will require minimal changes in SQL syntax, just adding ASOF keyword 
seems to be enough:


   select  * from Trades ASOF JOIN EqTrades USING (underlyingSym,time);

It seems to me that adding ASOF joins should not require huge amount of 
work and can be done with minimal number of changes in executor and 
optimizer.

But may be there are some problems/challenges which I do not realize now?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Surjective functional indexes

2017-06-09 Thread Konstantin Knizhnik

Attached please find rebased version of the patch.
Now "projection" attribute is used instead of surjective/injective.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/doc/src/sgml/ref/create_index.sgml b/doc/src/sgml/ref/create_index.sgml
index 83ee7d3..b221c18 100644
--- a/doc/src/sgml/ref/create_index.sgml
+++ b/doc/src/sgml/ref/create_index.sgml
@@ -294,8 +294,33 @@ CREATE [ UNIQUE ] INDEX [ CONCURRENTLY ] [ [ IF NOT EXISTS ] 
 The optional WITH clause specifies storage
 parameters for the index.  Each index method has its own set of allowed
-storage parameters.  The B-tree, hash, GiST and SP-GiST index methods all
-accept this parameter:
+storage parameters. All indexes accept the following parameter:
+   
+
+   
+   
+projection
+
+ 
+   Functional index is based on on projection function: function which extract subset of its argument. 
+   In mathematic such functions are called non-injective. For injective function if any attribute used in the indexed 
+   expression is changed, then value of index expression is also changed. So to check that index is affected by the 
+   update, it is enough to check the set of changed fields. By default this parameters is assigned true value and function is considered 
+   as non-injective.
+   In this case change of any of indexed key doesn't mean that value of the function is changed. For example, for 
+   the expression expression(bookinfo-'isbn') defined
+   for column of JSON type is changed only when ISBN is changed, which rarely happen. The same is true for most
+   functional indexes. For non-injective functions, Postgres compares values of indexed expression for old and updated tuple and updates
+   index only when function results are different. It allows to eliminate index update and use HOT update.
+   But there are extra evaluations of the functions. So if function is expensive or probability that change of indexed column will not effect 
+   the function value is small, then marking index as projection may increase update speed.
+
+
+   
+   
+
+   
+ The B-tree, hash, GiST and SP-GiST index methods all accept this parameter:

 

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 6d1f22f..509c647 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -130,6 +130,15 @@ static relopt_bool boolRelOpts[] =
 	},
 	{
 		{
+			"projection",
+			"Evaluate functional index expression on update to check if its values is changed",
+			RELOPT_KIND_INDEX,
+			AccessExclusiveLock
+		},
+		true
+	},
+	{
+		{
 			"security_barrier",
 			"View acts as a row security barrier",
 			RELOPT_KIND_VIEW,
@@ -1301,7 +1310,7 @@ fillRelOptions(void *rdopts, Size basesize,
 break;
 			}
 		}
-		if (validate && !found)
+		if (validate && !found && options[i].gen->kinds != RELOPT_KIND_INDEX)
 			elog(ERROR, "reloption \"%s\" not found in parse table",
  options[i].gen->name);
 	}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e890e08..2be99ab 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -56,6 +56,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/index.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -73,7 +74,9 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
-
+#include "utils/memutils.h"
+#include "nodes/execnodes.h"
+#include "executor/executor.h"
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -124,6 +127,7 @@ static bool ConditionalMultiXactIdWait(MultiXactId multi, MultiXactStatus status
 static XLogRecPtr log_heap_new_cid(Relation relation, HeapTuple tup);
 static HeapTuple ExtractReplicaIdentity(Relation rel, HeapTuple tup, bool key_modified,
 	   bool *copy);
+static bool ProjectionIsNotChanged(Relation relation, HeapTuple oldtup, HeapTuple newtup);
 
 
 /*
@@ -3533,8 +3537,6 @@ heap_update(Relation relation, ItemPointer otid, HeapTuple newtup,
 	key_attrs = RelationGetIndexAttrBitmap(relation, INDEX_ATTR_BITMAP_KEY);
 	id_attrs = RelationGetIndexAttrBitmap(relation,
 		  INDEX_ATTR_BITMAP_IDENTITY_KEY);
-
-
 	block = ItemPointerGetBlockNumber(otid);
 	buffer = ReadBuffer(relation, block);
 	page = BufferGetPage(buffer);
@@ -4161,8 +4163,12 @@ l2:
 		 * changed. If the page was already full, we may have skipped checking
 		 * for index columns. If so, HOT update is possible.
 		 */
-		if (hot_attrs_checked && !bms_overlap(modifie

Re: [HACKERS] Proposal : For Auto-Prewarm.

2017-05-30 Thread Konstantin Knizhnik

On 27.10.2016 14:39, Mithun Cy wrote:

# pg_autoprewarm.

This a PostgreSQL contrib module which automatically dump all of the 
blocknums
present in buffer pool at the time of server shutdown(smart and fast 
mode only,
to be enhanced to dump at regular interval.) and load these blocks 
when server restarts.


Design:
--
We have created a BG Worker Auto Pre-warmer which during shutdown 
dumps all the

blocknum in buffer pool in sorted order.
Format of each entry is 
<DatabaseId,TableSpaceId,RelationId,Forknum,BlockNum>.
Auto Pre-warmer is started as soon as the postmaster is started we do 
not wait
for recovery to finish and database to reach a consistent state. If 
there is a

"dump_file" to load we start loading each block entry to buffer pool until
there is a free buffer. This way we do not replace any new blocks 
which was
loaded either by recovery process or querying clients. Then it waits 
until it receives

SIGTERM to dump the block information in buffer pool.

HOW TO USE:
---
Build and add the pg_autoprewarm to shared_preload_libraries. Auto 
Pre-warmer
process automatically do dumping of buffer pool's block info and load 
them when

restarted.

TO DO:
--
Add functionality to dump based on timer at regular interval.
And some cleanups.


I wonder if you considered parallel prewarming of a table?
Right now either with pg_prewarm, either with pg_autoprewarm, preloading 
table's data is performed by one backend.
It certainly makes sense if there is just one HDD and we want to 
minimize impact of pg_prewarm on normal DBMS activity.
But sometimes we need to load data in memory as soon as possible. And 
modern systems has larger number of CPU cores and

RAID devices make it possible to efficiently load data in parallel.

I have asked this question in context of my CFS (compressed file system) 
for Postgres. The customer's complaint was that there are 64 cores at 
his system but when
he is building index, decompression of heap data is performed by only 
one core. This is why I thought about prewarm... (parallel index 
construction is separate story...)


pg_prewarm makes is possible to specify range of blocks, so, in 
principle, it is possible to manually preload table in parallel, by 
spawining pg_prewarm
with different subranges in several backends. But it is definitely not 
user friendly approach.
And as far as I understand pg_autoprewarm has all necessary 
infrastructure to do parallel load. We just need to spawn more than one 
background worker and specify

separate block range for each worker.

Do you think that such functionality (parallel autoprewarm) can be 
useful and be easily added?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surjective functional indexes

2017-05-30 Thread Konstantin Knizhnik



On 29.05.2017 20:21, Christoph Berg wrote:


I think the term you were looking for is "projection".

https://en.wikipedia.org/wiki/Projection_(set_theory)


I have already renamed parameter from "surjective" to "injective".
But I am ok to do do one more renaming to "projection" if it will be 
considered as better alternative.
From my point of view, "projection" seems to be clearer for people 
without mathematical background,
but IMHO this term is overloaded in DBMS context. The irony is that in 
Wikipedia "projection" is explained using "surjection" term:)




Christoph


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surjective functional indexes

2017-05-29 Thread Konstantin Knizhnik

On 05/27/2017 09:50 PM, Peter Eisentraut wrote:

On 5/25/17 12:30, Konstantin Knizhnik wrote:

Functions like (info->>'name') are named "surjective" ni mathematics.

A surjective function is one where each value in the output type can be
obtained by some input value.  That's not what you are after here.  The
behavior you are describing is a not-injective function.

I think you are right that in practice most functions are not injective.
  But I think there is still quite some difference between a function
like the one you showed that selects a component from a composite data
structure and, for example, round(), where in practice any update is
likely to change the result of the function.


Thank you, I will rename "surjective" parameter to "injective" with "false" as 
default value.
Concerning "round" and other similar functions - obviously there are use cases 
when such functions are used for
functional indexes. This is why I want to allow user to make a choice and this 
is the reason of introducing this parameter.
The question is the default value of this parameter: should we by default 
preserve original Postgres behavior:
check only affected set of keys or should we pay extra cost for calculating 
value of the function (even if we managed to store
calculated value of the indexes expression for new tuple, we still have to 
calculate it for old tuple, so function will be calculated
at least twice more times).

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Logical replication & corrupted pages recovery

2017-05-26 Thread Konstantin Knizhnik
Several PgPro cstomers, which are evaluating our multimaster, are 
interested in possibility to recover corrupted pages from other cluster 
nodes.
This task seems to be more general and is not multimaster specific. This 
is why I want to discuss it here.


With physical (streaming) replication content of master and replica 
database are identical, so it is quite easy do restore corrupted page 
from the replica by just copying correspondent file or part of file. 
With logical replication content of database pages on the disk may be 
different even through data is logically identical.
If some heap page is corrupted, then there is no some simple and 
efficient way to determine records which were located on this page.

Clustered indexes can help, but this is a long story...

So my question is whether there is now some efficient way to synchronize 
two tables?
If not, are there any plans to provide such functionality in logical 
replication in future?


Right now, the only approach which comes to me mind is to extract all 
primary keys at two nodes, exchanges them between nodes, find out 
missing tuples by comparing two ordered set of keys and request them 
from other node. It is based on the assumption that Postgres just skips 
records from the corrupted pages.
The drawback of this approach is that it will be very slow and cause 
large network traffic for huge tables.
May be it is possible to somehow optimize it, by checking ranges of 
primary key values
(if number of records in the range is the same at both nodes, then 
ranges can be considered as identical and not compared).


Also this approach requires suspending of cluster while table 
synchronization (or at least, locking this table).
Synchronization of table in case of presence of active updates of this 
tables seems to be much more challenged task.


If somebody has already thought about this problem, have some plan or 
may be even ready solution for it, please share your thoughts.

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Surjective functional indexes

2017-05-25 Thread Konstantin Knizhnik



On 25.05.2017 19:37, Tom Lane wrote:

Konstantin Knizhnik <k.knizh...@postgrespro.ru> writes:

My proposal is to check value of function for functional indexes instead
of just comparing set of effected attributes.
Obviously, for some complex functions it may  have negative effect on
update speed.
This is why I have added "surjective" option to index.

This seems overcomplicated.  We would have to compute the function
value at some point anyway.  Can't we refactor to do that earlier?

regards, tom lane



Check for affected indexes/applicability of HOT update and update of 
indexes themselves is done in two completely different parts of code.
And if we find out that values of indexed expressions are not changed, 
then we can use HOT update and indexes should not be updated
(so calculated value of function is not needed). And it is expected to 
be most frequent case.


Certainly, if value of indexed expression is changed, then we can avoid 
redundant calculation of function by storing result of calculations 
somewhere.
But it will greatly complicate all logic of updating indexes. Please 
notice, that if we have several functional indexes and only one of them 
is actually changed,
then in any case we can not use HOT and have to update all indexes. So 
we do not need to evaluate values of all indexed expressions. We just 
need to find first
changed one. So we should somehow keep track values of which expression 
are calculated and which not.


One more argument. Originally Postgres evaluates index expression only 
once (when inserting new version of tuple to the index).
Now (with this patch) Postgres has to evaluate expression three times in 
the worst case: calculate the value of expression for old and new tuples 
to make a decision bout hot update,
and the evaluate it once again when performing index update itself. Even 
if I managed to store somewhere calculated value of the expression, we 
still have to perform
twice more evaluations than before. This is why for expensive functions 
or for functions defined for frequently updated attributes (in case of 
JSON) such policy should be disabled.
And for non-expensive functions extra overhead is negligible. Also there 
is completely no overhead if indexed expression is not actually changed. 
And it is expected to be most frequent case.


At least at the particular example with YCSB benchmark, our first try 
was just to disable index update by commenting correspondent check of 
updated fields mask.
Obviously there are no extra function calculations in this case. Then I 
have implemented this patch. And performance is almost the same.
This is why I think that simplicity and modularity of code is more 
important here than elimination of redundant function calculation.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Surjective functional indexes

2017-05-25 Thread Konstantin Knizhnik
Right now Postgres determines whether update operation touch index or 
not based only on set of the affected columns.
But in case of functional indexes such policy quite frequently leads to 
unnecessary index updates.
For example, functional index are widely use for indexing JSON data: 
info->>'name'.


JSON data may contain multiple attributes and only few of them may be 
affected by update.
Moreover, index is used to build for immutable attributes (like "id", 
"isbn", "name",...).


Functions like (info->>'name') are named "surjective" ni mathematics.
I have strong feeling that most of functional indexes are based on 
surjective functions.
For such indexes current Postgresql index update policy is very 
inefficient.  It cause disabling of hot updates

and so leads to significant degrade of performance.

Without this patch Postgres is slower than Mongo on YCSB benchmark with 
(50% update,50 % select)  workload.

And after applying this patch Postgres beats Mongo at all workloads.

My proposal is to check value of function for functional indexes instead 
of just comparing set of effected attributes.
Obviously, for some complex functions it may  have negative effect on 
update speed.
This is why I have added "surjective" option to index. By default it is 
switched on for all functional indexes (based on my assumption
that most functions used in functional indexes are surjective). But it 
is possible to explicitly disable it and make decision weather index

needs to be updated or not only based on set of effected attributes.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/access/common/reloptions.c b/src/backend/access/common/reloptions.c
index 6d1f22f..37fc407 100644
--- a/src/backend/access/common/reloptions.c
+++ b/src/backend/access/common/reloptions.c
@@ -130,6 +130,15 @@ static relopt_bool boolRelOpts[] =
 	},
 	{
 		{
+			"surjective",
+			"Reevaluate functional index expression on update to check if its values is changed",
+			RELOPT_KIND_INDEX,
+			AccessExclusiveLock
+		},
+		true
+	},
+	{
+		{
 			"security_barrier",
 			"View acts as a row security barrier",
 			RELOPT_KIND_VIEW,
@@ -1301,7 +1310,7 @@ fillRelOptions(void *rdopts, Size basesize,
 break;
 			}
 		}
-		if (validate && !found)
+		if (validate && !found && options[i].gen->kinds != RELOPT_KIND_INDEX)
 			elog(ERROR, "reloption \"%s\" not found in parse table",
  options[i].gen->name);
 	}
diff --git a/src/backend/access/heap/heapam.c b/src/backend/access/heap/heapam.c
index e890e08..3525e3c 100644
--- a/src/backend/access/heap/heapam.c
+++ b/src/backend/access/heap/heapam.c
@@ -56,6 +56,7 @@
 #include "access/xlogutils.h"
 #include "catalog/catalog.h"
 #include "catalog/namespace.h"
+#include "catalog/index.h"
 #include "miscadmin.h"
 #include "pgstat.h"
 #include "storage/bufmgr.h"
@@ -73,7 +74,9 @@
 #include "utils/snapmgr.h"
 #include "utils/syscache.h"
 #include "utils/tqual.h"
-
+#include "utils/memutils.h"
+#include "nodes/execnodes.h"
+#include "executor/executor.h"
 
 /* GUC variable */
 bool		synchronize_seqscans = true;
@@ -4199,6 +4202,7 @@ l2:
 
 	if (use_hot_update)
 	{
+		elog(DEBUG1, "Use hot update");
 		/* Mark the old tuple as HOT-updated */
 		HeapTupleSetHotUpdated();
 		/* And mark the new tuple as heap-only */
@@ -4436,6 +4440,73 @@ HeapDetermineModifiedColumns(Relation relation, Bitmapset *interesting_cols,
 attnum - FirstLowInvalidHeapAttributeNumber);
 	}
 
+	if (hot_result && relation->rd_surjective)
+	{
+		ListCell   *l;
+		List	   *indexoidlist = RelationGetIndexList(relation);
+		EState *estate = CreateExecutorState();
+		ExprContext*econtext = GetPerTupleExprContext(estate);
+		TupleTableSlot *slot = MakeSingleTupleTableSlot(RelationGetDescr(relation));
+		Datum	   	old_values[INDEX_MAX_KEYS];
+		bool		old_isnull[INDEX_MAX_KEYS];
+		Datum	   	new_values[INDEX_MAX_KEYS];
+		bool		new_isnull[INDEX_MAX_KEYS];
+
+		econtext->ecxt_scantuple = slot;
+
+		foreach(l, indexoidlist)
+		{
+			Oid		indexOid = lfirst_oid(l);
+			RelationindexDesc = index_open(indexOid, AccessShareLock);
+			IndexInfo  *indexInfo = BuildIndexInfo(indexDesc);
+			int i;
+
+			if (indexInfo->ii_Expressions && indexInfo->ii_Surjective)
+			{
+ResetExprContext(econtext);
+ExecStoreTuple(oldtup, slot, InvalidBuffer, false);
+FormIndexDatum(indexInfo,
+			   slot,
+			   estate,
+			   old_values,
+			   old_isnull);
+
+ExecStoreTuple(newtup, slot, InvalidBuffer, false);
+FormIndexDatum(indexInfo,
+			   slot,
+			   esta

Re: [HACKERS] Cached plans and statement generalization

2017-05-25 Thread Konstantin Knizhnik

On 10.05.2017 19:11, Konstantin Knizhnik wrote:


Based on the Robert's feedback and Tom's proposal I have implemented 
two new versions of autoprepare patch.


First version is just refactoring of my original implementation: I 
have extracted common code into prepare_cached_plan and 
exec_prepared_plan
function to avoid code duplication. Also I rewrote assignment of 
values to parameters. Now types of parameters are inferred from types 
of literals, so there may be several
prepared plans which are different only by types of parameters. Due to 
the problem with type coercion for parameters, I have to catch errors 
in parse_analyze_varparams.




Attached please find rebased version of the autoprepare patch based on 
Tom's proposal (perform analyze for tree with constant literals and then 
replace them with parameters).

Also I submitted this patch for the Autum commitfest.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index 95c1d3e..0b0642b 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3710,6 +3710,454 @@ raw_expression_tree_walker(Node *node,
 }
 
 /*
+ * raw_expression_tree_mutator --- transform raw parse tree.
+ *
+ * This function is implementing slightly different approach for tree update than expression_tree_mutator().
+ * Callback is given pointer to pointer to the current node and can update this field instead of returning reference to new node.
+ * It makes it possible to remember changes and easily revert them without extra traversal of the tree.
+ *
+ * This function do not need QTW_DONT_COPY_QUERY flag: it never implicitly copy tree nodes, doing in-place update.
+ *
+ * Like raw_expression_tree_walker, there is no special rule about query
+ * boundaries: we descend to everything that's possibly interesting.
+ *
+ * Currently, the node type coverage here extends only to DML statements
+ * (SELECT/INSERT/UPDATE/DELETE) and nodes that can appear in them, because
+ * this is used mainly during analysis of CTEs, and only DML statements can
+ * appear in CTEs. If some other node is visited, iteration is immediately stopped and true is returned.
+ */
+bool
+raw_expression_tree_mutator(Node *node,
+			bool (*mutator) (),
+			void *context)
+{
+	ListCell   *temp;
+
+	/*
+	 * The walker has already visited the current node, and so we need only
+	 * recurse into any sub-nodes it has.
+	 */
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex expressions */
+	check_stack_depth();
+
+	switch (nodeTag(node))
+	{
+		case T_SetToDefault:
+		case T_CurrentOfExpr:
+		case T_Integer:
+		case T_Float:
+		case T_String:
+		case T_BitString:
+		case T_Null:
+		case T_ParamRef:
+		case T_A_Const:
+		case T_A_Star:
+			/* primitive node types with no subnodes */
+			break;
+		case T_Alias:
+			/* we assume the colnames list isn't interesting */
+			break;
+		case T_RangeVar:
+			return mutator(&((RangeVar *) node)->alias, context);
+		case T_GroupingFunc:
+			return mutator(&((GroupingFunc *) node)->args, context);
+		case T_SubLink:
+			{
+SubLink	   *sublink = (SubLink *) node;
+
+if (mutator(>testexpr, context))
+	return true;
+/* we assume the operName is not interesting */
+if (mutator(>subselect, context))
+	return true;
+			}
+			break;
+		case T_CaseExpr:
+			{
+CaseExpr   *caseexpr = (CaseExpr *) node;
+
+if (mutator(>arg, context))
+	return true;
+/* we assume mutator(& doesn't care about CaseWhens, either */
+foreach(temp, caseexpr->args)
+{
+	CaseWhen   *when = (CaseWhen *) lfirst(temp);
+
+	Assert(IsA(when, CaseWhen));
+	if (mutator(>expr, context))
+		return true;
+	if (mutator(>result, context))
+		return true;
+}
+if (mutator(>defresult, context))
+	return true;
+			}
+			break;
+		case T_RowExpr:
+			/* Assume colnames isn't interesting */
+			return mutator(&((RowExpr *) node)->args, context);
+		case T_CoalesceExpr:
+			return mutator(&((CoalesceExpr *) node)->args, context);
+		case T_MinMaxExpr:
+			return mutator(&((MinMaxExpr *) node)->args, context);
+		case T_XmlExpr:
+			{
+XmlExpr	   *xexpr = (XmlExpr *) node;
+
+if (mutator(>named_args, context))
+	return true;
+/* we assume mutator(& doesn't care about arg_names */
+if (mutator(>args, context))
+	return true;
+			}
+			break;
+		case T_NullTest:
+			return mutator(&((NullTest *) node)->arg, context);
+		case T_BooleanTest:
+			return mutator(&((BooleanTest *) node)->arg, context);
+		case T_JoinExpr:
+			{
+JoinExpr   *join = (JoinExpr *) node;
+
+if (mutator(>larg, context))
+	return true;
+if (mutator(>rarg, context))
+	return true;
+if (mutator(>quals, context))
+	return true;
+		

Re: [HACKERS] Cached plans and statement generalization

2017-05-18 Thread Konstantin Knizhnik



On 15.05.2017 18:31, Robert Haas wrote:

On Wed, May 10, 2017 at 12:11 PM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

Robert, can you please explain why using TRY/CATCH is not safe here:

This is definitely not a safe way of using TRY/CATCH.

This has been discussed many, many times on this mailing list before,
and I don't really want to go over it again here.  We really need a
README or some documentation about this so that we don't have to keep
answering this same question again and again.

First of all I want to notice that new version of my patch is not using 
PG_TRY/PG_CATCH.

But I still want to clarify for myself whats wrong with this constructions.
I searched both hackers mailing list archive and world-wide using google 
but failed to find any references except of

sharing non-volatilie variables between try and catch blocks.
Can you please point me at the thread where this problem was discussed 
or just explain in few words the source of the problem?


From my own experience I found out that PG_TRY/PG_CATCH mechanism is 
not providing proper cleanup (unlike C++ exceptions).
If there are opened relations, catalog cache entries,... then throwing 
error will not release them.
It will cause no problems if error is handled in PostgresMain which 
aborts current transaction and releases all resources in any case.
But if I want to ignore this error and continue query execution, then 
warnings about resources leaks can be reported.

Is it want you mean by unsafety of PG_TRY/PG_CATCH constructions?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-05-12 Thread Konstantin Knizhnik



On 12.05.2017 18:23, Bruce Momjian wrote:

On Fri, May 12, 2017 at 10:50:41AM +0300, Konstantin Knizhnik wrote:

Definitely changing session context (search_path, date/time format, ...) may
cause incorrect behavior of cached statements.

I wonder if we should clear the cache whenever any SET command is
issued.


Well, with autoprepare cache disabled on each set variable, alter system 
and any slow utility statement
only one regression test is not passed. And only because of different 
error message context:


*** /home/knizhnik/postgresql.master/src/test/regress/expected/date.out 
2017-04-11 18:07:56.497461208 +0300
--- /home/knizhnik/postgresql.master/src/test/regress/results/date.out 
2017-05-12 20:21:19.767566302 +0300

***
*** 1443,1452 
  --
  SELECT EXTRACT(MICROSEC  FROM DATE 'infinity'); -- ERROR: 
timestamp units "microsec" not recognized

  ERROR:  timestamp units "microsec" not recognized
- CONTEXT:  SQL function "date_part" statement 1
  SELECT EXTRACT(UNDEFINED FROM DATE 'infinity'); -- ERROR: 
timestamp units "undefined" not supported

  ERROR:  timestamp units "undefined" not supported
- CONTEXT:  SQL function "date_part" statement 1
  -- test constructors
  select make_date(2013, 7, 15);
   make_date
--- 1443,1450 

==




Actually you may get the same problem with explicitly prepared statements
(certainly, in the last case, you better understand what going on and it is
your choice whether to use or not to use prepared statement).

The fact of failure of 7 regression tests means that autoprepare can really
change behavior of existed program. This is why my suggestion is  to switch
off this feature by default.

I would like to see us target something that can be enabled by default.
Even if it only improves performance by 5%, it would be better overall
than a feature that improves performance by 90% but is only used by 1%
of our users.


I have to agree with you here.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-05-12 Thread Konstantin Knizhnik



On 12.05.2017 03:58, Bruce Momjian wrote:

On Thu, May 11, 2017 at 10:41:45PM +0300, Konstantin Knizhnik wrote:

This is why I have provided second implementation which replace
literals with parameters after raw parsing.  Certainly it is slower
than first approach. But still provide significant advantage in
performance: more than two times at pgbench.  Then I tried to run
regression tests and find several situations where type analysis is
not correctly performed in case of replacing literals with parameters.

So the issue is that per-command output from the parser, SelectStmt,
only has strings for identifers, e.g. table and column names, so you
can't be sure it is the same as the cached entry you matched.  I suppose
if you cleared the cache every time someone created an object or changed
search_path, it might work.

Definitely changing session context (search_path, date/time format, ...) 
may cause incorrect behavior of cached statements.
Actually you may get the same problem with explicitly prepared 
statements (certainly, in the last case, you better understand what 
going on and it is your choice whether to use or not to use prepared 
statement).


The fact of failure of 7 regression tests means that autoprepare can 
really change behavior of existed program. This is why my suggestion is  
to switch off this feature by default.
But in 99.9% real cases (my estimation plucked out of thin air:) there 
will be no such problems with autoprepare. And it can significantly 
improve performance of OLTP applications
which are not able to use prepared statements (because of working 
through pgbouncer or any other reasons).


Can autoprepare slow down the system?
Yes, it can. It can happen if application perform larger number of 
unique queries and autoprepare cache size is not limited.
In this case large (and infinitely growing) number of stored plans can 
consume a lot of memory and, what is even worse, slowdown cache lookup.
This is why I by default limit number of cached statements 
(autoprepare_limit parameter) by 100.


I am almost sure that there will be some other issues with autoprepare 
which I have not encountered yet (because I mostly tested it on pgbench 
and Postgres regression tests).
But I am also sure that benefit of doubling system performance is good 
motivation to continue work in this direction.


My main concern is whether to continue to improve current approach with 
local (per-backend) cache of prepared statements.
Or create shared cache (as in Oracle). It is much more difficult to 
implement shared cache (the same problem with session context, different 
catalog snapshots, cache invalidation,...)
but it also provides more opportunities for queries optimization and 
tuning.






--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-05-11 Thread Konstantin Knizhnik

On 05/11/2017 10:52 PM, Andres Freund wrote:

On 2017-05-11 22:48:26 +0300, Konstantin Knizhnik wrote:

On 05/11/2017 09:31 PM, Tom Lane wrote:

Bruce Momjian <br...@momjian.us> writes:

Good point.  I think we need to do some measurements to see if the
parser-only stage is actually significant.  I have a hunch that
commercial databases have much heavier parsers than we do.

FWIW, gram.y does show up as significant in many of the profiles I take.
I speculate that this is not so much that it eats many CPU cycles, as that
the constant tables are so large as to incur lots of cache misses.  scan.l
is not quite as big a deal for some reason, even though it's also large.

regards, tom lane

Yes, my results shows that pg_parse_query adds not so much overhead:
206k TPS for my first variant with string literal substitution and modified 
query text used as hash key vs.
181k. TPS for version with patching raw parse tree constructed by 
pg_parse_query.

Those numbers and your statement seem to contradict each other?


Oops, my parse error:( I incorrectly read Tom's statement.
Actually, I also was afraid that price of parsing is large enough and this is 
why my first attempt was to avoid parsing.
But then I find out that most of the time is spent in analyze and planning (see 
attached profile):

pg_parse_query: 4.23%
pg_analyze_and_rewrite 8.45%
pg_plan_queries: 15.49%



- Andres



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-05-11 Thread Konstantin Knizhnik

On 05/11/2017 09:31 PM, Tom Lane wrote:

Bruce Momjian <br...@momjian.us> writes:

Good point.  I think we need to do some measurements to see if the
parser-only stage is actually significant.  I have a hunch that
commercial databases have much heavier parsers than we do.

FWIW, gram.y does show up as significant in many of the profiles I take.
I speculate that this is not so much that it eats many CPU cycles, as that
the constant tables are so large as to incur lots of cache misses.  scan.l
is not quite as big a deal for some reason, even though it's also large.

regards, tom lane

Yes, my results shows that pg_parse_query adds not so much overhead:
206k TPS for my first variant with string literal substitution and modified 
query text used as hash key vs.
181k. TPS for version with patching raw parse tree constructed by 
pg_parse_query.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-05-11 Thread Konstantin Knizhnik

On 05/11/2017 06:12 PM, Bruce Momjian wrote:

On Wed, May 10, 2017 at 07:11:07PM +0300, Konstantin Knizhnik wrote:

I am going to continue work on this patch I will be glad to receive any
feedback and suggestions for its improvement.
In most cases, applications are not accessing Postgres directly, but using
some connection pooling layer and so them are not able to use prepared
statements.
But at simple OLTP Postgres spent more time on building query plan than on
execution itself. And it is possible to speedup Postgres about two times at
such workload!
Another alternative is true shared plan cache.  May be it is even more
perspective approach, but definitely much more invasive and harder to
implement.

Can we back up and get an overview of what you are doing and how you are
doing it?  Our TODO list suggests this order for successful patches:

Desirability -> Design -> Implement -> Test -> Review -> Commit

You kind of started at the Implementation/patch level, which makes it
hard to evaluate.

I think everyone agrees on the Desirability of the feature, but the
Design is the tricky part.  I think the design questions are:

*  What information is stored about cached plans?
*  How are the cached plans invalidated?
*  How is a query matched against a cached plan?

Looking at the options, ideally the plan would be cached at the same
query stage as the stage where the incoming query is checked against the
cache.  However, caching and checking at the same level offers no
benefit, so they are going to be different.  For example, caching a
parse tree at the time it is created, then checking at the same point if
the incoming query is the same doesn't help you because you already had
to create the parse tree get to that point.

A more concrete example is prepared statements.  They are stored at the
end of planning and matched in the parser.  However, you can easily do
that since the incoming query specifies the name of the prepared query,
so there is no trick to matching.

The desire is to cache as late as possible so you cache more work and
you have more detail about the referenced objects, which helps with
cache invalidation.  However, you also want to do cache matching as
early as possible to improve performance.

So, let's look at some options.  One interesting idea from Doug Doole
was to do it between the tokenizer and parser.  I think they are glued
together so you would need a way to run the tokenizer separately and
compare that to the tokens you stored for the cached plan.  The larger
issue is that prepared plans already are checked after parsing, and we
know they are a win, so matching any earlier than that just seems like
overkill and likely to lead to lots of problems.

So, you could do it after parsing but before parse-analysis, which is
kind of what prepared queries do.  One tricky problem is that we don't
bind the query string tokens to database objects until after parse
analysis.

Doing matching before parse-analysis is going to be tricky, which is why
there are so many comments about the approach.  Changing search_path can
certainly affect it, but creating objects in earlier-mentioned schemas
can also change how an object reference in a query is resolved.  Even
obscure things like the creation of a new operator that has higher
precedence in the query could change the plan, though am not sure if
our prepared query system even handles that properly.

Anyway, that is my feedback.  I would like to get an overview of what
you are trying to do and the costs/benefits of each option so we can
best guide you.


Sorry, for luck of overview.
I have started with small prototype just to investigate if such optimization 
makes sense or not.
When I get more than two time advantage in performance on standard pgbench, I 
come to conclusion that this
optimization can be really very useful and now try to find the best way of its 
implementation.

I have started with simplest approach when string literals are replaced with 
parameters. It is done before parsing.
And can be done very fast - just need to locate data in quotes.
But this approach is not safe and universal: you will not be able to speedup 
most of the existed queries without rewriting them.

This is why I have provided second implementation which replace literals with 
parameters after raw parsing.
Certainly it is slower than first approach. But still provide significant 
advantage in performance: more than two times at pgbench.
Then I tried to run regression tests and find several situations where type 
analysis is not correctly performed in case of replacing literals with 
parameters.

So my third attempt is to replace constant nodes with parameters in already 
analyzed tree.

Now answering your questions:

*  What information is stored about cached plans?

Key to locate cached plan is raw parse tree. Value is saved CachedPlanSource.

*  How are the cached plans invalidated?

In the same way as plans for explicitly prepared stat

Re: [HACKERS] Cached plans and statement generalization

2017-05-10 Thread Konstantin Knizhnik

On 02.05.2017 21:26, Robert Haas wrote:


I am sympathetic to the fact that this is a hard problem to solve.
I'm just telling you that the way you've got it is not acceptable and
nobody's going to commit it like that (or if they do, they will end up
having to revert it).  If you want to have a technical discussion
about what might be a way to change the patch to be more acceptable,
cool, but I don't want to get into a long debate about whether what
you have is acceptable or not; I've already said what I think about
that and I believe that opinion will be widely shared.  I am not
trying to beat you up here, just trying to be clear.




Based on the Robert's feedback and Tom's proposal I have implemented two 
new versions of autoprepare patch.


First version is just refactoring of my original implementation: I have 
extracted common code into prepare_cached_plan and exec_prepared_plan
function to avoid code duplication. Also I rewrote assignment of values 
to parameters. Now types of parameters are inferred from types of 
literals, so there may be several
prepared plans which are different only by types of parameters. Due to 
the problem with type coercion for parameters, I have to catch errors in 
parse_analyze_varparams.


Robert, can you please explain why using TRY/CATCH is not safe here:

This is definitely not a safe way of using TRY/CATCH.


Second version is based on Tom's suggestion:

Personally I'd think about
replacing the entire literal-with-cast construct with a Param having
already-known type.
So here I first patch raw parse tree, replacing A_Const with ParamRef. 
Such plan is needed to perform cache lookup.
Then I restore original raw parse tree and call pg_analyze_and_rewrite. 
Then I mutate analyzed tree, replacing Const with Param nodes.
In this case type coercion is already performed and I know precise types 
which should be used for parameters.
It seems to be more sophisticated approach. And I can not extract common 
code in prepare_cached_plan,
because preparing of plan is currently mix of steps done in 
exec_simple_query and exec_parse_message.

But there is no need to catch analyze errors.

Finally performance of both approaches is the same: at pgbench it is 
180k TPS on read-only queries, comparing with 80k TPS for not prepared 
queries.
In both cases 7 out of  177 regression tests  are not passed (mostly 
because session environment is changed between subsequent query execution).


I am going to continue work on this patch I will be glad to receive any 
feedback and suggestions for its improvement.
In most cases, applications are not accessing Postgres directly, but 
using some connection pooling layer and so them are not able to use 
prepared statements.
But at simple OLTP Postgres spent more time on building query plan than 
on execution itself. And it is possible to speedup Postgres about two 
times at such workload!
Another alternative is true shared plan cache.  May be it is even more 
perspective approach, but definitely much more invasive and harder to 
implement.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index 6e52eb7..f2eb0f5 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3696,6 +3696,454 @@ raw_expression_tree_walker(Node *node,
 }
 
 /*
+ * raw_expression_tree_mutator --- transform raw parse tree. 
+ *
+ * This function is implementing slightly different approach for tree update than expression_tree_mutator().
+ * Callback is given pointer to pointer to the current node and can update this field instead of returning reference to new node.
+ * It makes it possible to remember changes and easily revert them without extra traversal of the tree.
+ * 
+ * This function do not need QTW_DONT_COPY_QUERY flag: it never implicitly copy tree nodes, doing in-place update.
+ * 
+ * Like raw_expression_tree_walker, there is no special rule about query
+ * boundaries: we descend to everything that's possibly interesting.
+ *
+ * Currently, the node type coverage here extends only to DML statements
+ * (SELECT/INSERT/UPDATE/DELETE) and nodes that can appear in them, because
+ * this is used mainly during analysis of CTEs, and only DML statements can
+ * appear in CTEs. If some other node is visited, iteration is immediately stopped and true is returned.
+ */
+bool
+raw_expression_tree_mutator(Node *node,
+			bool (*mutator) (),
+			void *context)
+{
+	ListCell   *temp;
+
+	/*
+	 * The walker has already visited the current node, and so we need only
+	 * recurse into any sub-nodes it has.
+	 */
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex expressions */
+	check_stack_depth();
+
+	switch (nodeTag(node))
+	{
+		case T_SetToDefault:
+		case T_CurrentOfExpr:
+		case T_Integer:
+		case T_Float:
+		case T_String:
+		case T_BitString:
+		case T_Null:
+		case

Re: [HACKERS] Why type coercion is not performed for parameters?

2017-05-05 Thread Konstantin Knizhnik



On 05.05.2017 13:29, Marko Tiikkaja wrote:


But you know that the type of the literal "10" is int. If you're 
throwing that information away, surely that's a bug in your code.



Yes, in case of integer literal I can easily determine parameter type.
But in case of string literal I have to set UNKNOWNOID type otherwise a 
lot of queries will not work.





.m


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[HACKERS] Why type coercion is not performed for parameters?

2017-05-05 Thread Konstantin Knizhnik

Hi hackers,

If I evaluate expression typename('literal'), then type coercion is 
performed and the function is successfully resolved, i.e.


SELECT regnamespace('"pg_catalog"');

But if I want to prepare this query, I get the error:

postgres=#  prepare foo as SELECT regnamespace($1);
ERROR:  function regnamespace(unknown) does not exist
LINE 1: prepare foo as SELECT regnamespace($1);

Certainly, I can explicitly specify parameter type:

prepare foo (text) as SELECT regnamespace($1);

and it will work. But it is not always possible.

Actually coerce_type function can normally handle parameters.
But func_get_detail always allows coercion only for constants:


if (sourceType == UNKNOWNOID && IsA(arg1, Const))
{
/* always treat typename('literal') as coercion */
iscoercion = true;
}

If this condition is changed to:

if (sourceType == UNKNOWNOID && (IsA(arg1, Const) || 
IsA(arg1, Param)))


then the example above will normally work.

Why do I need it? I want to implement autoprepare.
My original intention was to let parse_analyze_varparams to infer type 
of parameters from the context.
But it is not always possible  and sometime leads to different behavior 
of query.

For example if the query:

 select count(*) from test_range_gist where ir @> 10;

is replaced with

 select count(*) from test_range_gist where ir @> $1;

then type of parameter will be int4range rather then int, which 
corresponds to the different operator.


This is why now I infer parameter type from literal value. But in this 
case I get errors in parse_analyze_varparams which is not able to 
resolve some functions.
The fix in func_get_detail functions solves the problem and doesn't 
cause some new issues: all regression tests are passed.


So my question is whether it is possible to use the same rule for type 
coercion of parameters as for constant?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in prepared statement cache invalidation?

2017-05-02 Thread Konstantin Knizhnik

On 05/02/2017 09:30 PM, Robert Haas wrote:

I am not sure how critical is this problem. Definitely it rarely happens,
but lack of normal workarounds (restart backend, recreate function?) seems
to be  disappointing.

The problem goes away if you reconnect.  The problematic cache is only
backend-lifetime.


Most of clients are not connected to the Postgres directly, them are using some 
kind of connection pooling.
It means that backends are never restarted. And it will be necessary to restart 
the whole service just because we do not have
dependency tracking mechanism for PL code. Even invalidation of all functions 
in case of DDL seems to be more acceptable solution.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Bug in prepared statement cache invalidation?

2017-05-02 Thread Konstantin Knizhnik



On 01.05.2017 16:09, Robert Haas wrote:


This problem has been discussed before but nobody's done anything
about it.  The problem is a bit tricky because the core system doesn't
know anything about the function caches maintained by individual PLs.
I suppose ideally there'd be a way for a PL to say "if the definition
of X changes, please tell me to recompile function Y".  That probably
wouldn't be perfect because the PL might not be able to figure out
everything on which they actually depend; that might be
Turing-complete in some cases.  But even a partial solution would
probably be welcomed by users.



Thank you for explanation.
May be I am missing something, but what is the problem with keeping 
dependencies for  PL functions?
As you wrote, PL can inform core that functions depends on some set of 
relations/types/functions and so has to be recompiled if some of them 
are changed.
It is not necessary to build closure of dependency graph - instead of it 
cascade invalidation can be used.
So it is not clear to me where you see here the source of complexity and 
why this task may be "Turing-complete in some cases"?


The problem can be with overloaded functions and PL languages without 
static type checking.
In this case  resolving has to be performed at runtime during function 
evaluation. But there should be no such problem with PLpgSQL.


But definitely introducing such dependency tracking mechanism for PL 
will require a lot of changes, in all PL implementations. Looks like no 
easy fix is possible here.
I am not sure how critical is this problem. Definitely it rarely 
happens, but lack of normal workarounds (restart backend, recreate 
function?) seems to be  disappointing.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-05-02 Thread Konstantin Knizhnik



On 01.05.2017 18:52, Robert Haas wrote:
On Fri, Apr 28, 2017 at 6:01 AM, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:



Any comments and suggestions for future improvement of this patch
are welcome.


+PG_TRY();
+{
+query = parse_analyze_varparams(parse_tree,
+query_string,
+ _types,
+ _params);
+}
+PG_CATCH();
+{
+/*
+ * In case of analyze errors revert back to original 
query processing
+ * and disable autoprepare for this query to avoid such 
problems in future.

+ */
+FlushErrorState();
+if (snapshot_set) {
+PopActiveSnapshot();
+}
+entry->disable_autoprepare = true;
+undo_query_plan_changes(parse_tree, const_param_list);
+MemoryContextSwitchTo(old_context);
+return false;
+}
+PG_END_TRY();

This is definitely not a safe way of using TRY/CATCH.

+
+/* Convert literal value to parameter value */
+switch (const_param->literal->val.type)
+{
+  /*
+   * Convert from integer literal
+   */
+  case T_Integer:
+switch (ptype) {
+  case INT8OID:
+params->params[paramno].value = 
Int64GetDatum((int64)const_param->literal->val.val.ival);

+break;
+  case INT4OID:
+params->params[paramno].value = 
Int32GetDatum((int32)const_param->literal->val.val.ival);

+break;
+  case INT2OID:
+if (const_param->literal->val.val.ival < SHRT_MIN
+|| const_param->literal->val.val.ival > SHRT_MAX)
+{
+ereport(ERROR,
+ (errcode(ERRCODE_NUMERIC_VALUE_OUT_OF_RANGE),
+ errmsg("smallint out of range")));
+}
+params->params[paramno].value = 
Int16GetDatum((int16)const_param->literal->val.val.ival);

+break;
+  case FLOAT4OID:
+params->params[paramno].value = 
Float4GetDatum((float)const_param->literal->val.val.ival);

+break;
+  case FLOAT8OID:
+params->params[paramno].value = 
Float8GetDatum((double)const_param->literal->val.val.ival);

+break;
+  case INT4RANGEOID:
+sprintf(buf, "[%ld,%ld]", 
const_param->literal->val.val.ival, const_param->literal->val.val.ival);

+getTypeInputInfo(ptype, , );
+params->params[paramno].value = 
OidInputFunctionCall(typinput, buf, typioparam, -1);

+break;
+  default:
+ pg_lltoa(const_param->literal->val.val.ival, buf);
+getTypeInputInfo(ptype, , );
+params->params[paramno].value = 
OidInputFunctionCall(typinput, buf, typioparam, -1);

+}
+break;
+  case T_Null:
+params->params[paramno].isnull = true;
+break;
+  default:
+/*
+ * Convert from string literal
+ */
+getTypeInputInfo(ptype, , );
+params->params[paramno].value = 
OidInputFunctionCall(typinput, const_param->literal->val.val.str, 
typioparam, -1);

+}

I don't see something with a bunch of hard-coded rules for particular 
type OIDs having any chance of being acceptable.




Well, what I need is to convert literal value represented in Value 
struct to parameter datum value.

Struct "value" contains union with integer literal and text.
So this peace of code is just provides efficient handling of most common 
cases (integer parameters) and uses type's input function in other cases.



This patch seems to duplicate a large amount of existing code.  That 
would be a good thing to avoid.


Yes,  I have to copy a lot of code from exec_parse_message + 
exec_bind_message + exec_execute_message functions.
Definitely copying of code is bad flaw. It will be much better and 
easier just to call three original functions instead of mixing gathering 
their code into the new function.

But I failed to do it because
1.  Autoprepare should be integrated into exec_simple_query. Before 
executing query in normal way, I need to perform cache lookup for 
previously prepared plan for this generalized query.
And generalization of query requires building of query tree (query 
parsing). In other words, parsing should be done before I can call 
exec_parse_message.
2. exec_b

Re: [HACKERS] Cached plans and statement generalization

2017-04-28 Thread Konstantin Knizhnik



On 26.04.2017 13:46, Pavel Stehule wrote:


I attach new patch which allows to limit the number of
autoprepared statements (autoprepare_limit GUC variable).
Also I did more measurements, now with several concurrent
connections and read-only statements.
Results of pgbench with 10 connections, scale 10 and read-only
statements are below:

Protocol
TPS
extended
87k
prepared
209k
simple+autoprepare
206k


As you can see, autoprepare provides more than 2 times speed
improvement.

Also I tried to measure overhead of parsing (to be able to
substitute all literals, not only string literals).
I just added extra call of pg_parse_query. Speed is reduced to 181k.
So overhead is noticeable, but still making such optimization useful.
This is why I want to ask question:  is it better to implement
slower but safer and more universal solution?


Unsafe solution has not any sense, and it is dangerous (80% of 
database users has not necessary knowledge). If somebody needs the max 
possible performance, then he use explicit prepared statements.




I attached new patch to this mail. I completely reimplement my original 
approach and now use parse tree transformation.

New pgbench (-S -c 10) results are the following:

Protocol
TPS
extended
87k
prepared
209k
simple+autoprepare
185k


So there is some slowdown comparing with my original implementation and 
explicitly prepared statements, but still it provide more than two times 
speed-up comparing with unprepared queries. And it doesn't require to 
change existed applications.
As far as most of real production application are working with DBMS 
through some connection pool (pgbouncer,...), I think that such 
optimization will be useful.
Isn't it interesting if If we can increase system throughput almost two 
times by just setting one parameter in configuration file?


I also tried to enable autoprepare by default and run regression tests. 
7 tests are not passed because of the following reasons:
1. Slightly different error reporting (for example error location is not 
always identically specified).
2. Difference in query behavior caused by  changed local settings 
(Andres gives an example with search_path,  and date test is failed 
because of changing datestyle).
3. Problems with indirect dependencies (when table is altered only 
cached plans directly depending on this relation and invalidated, but 
not plans with indirect dependencies).

4. Not performing domain checks for null values.

I do not think that this issues can cause problems for real application.

Also it is possible to limit number of autoprepared statements using 
autoprepare_limit parameter, avoid possible backend memory overflow in 
case of larger number of unique queries sent by application. LRU 
discipline is used to drop least recently used plans.


Any comments and suggestions for future improvement of this patch are 
welcome.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/nodes/nodeFuncs.c b/src/backend/nodes/nodeFuncs.c
index cd39167..4fbc8b7 100644
--- a/src/backend/nodes/nodeFuncs.c
+++ b/src/backend/nodes/nodeFuncs.c
@@ -3610,6 +3610,454 @@ raw_expression_tree_walker(Node *node,
 }
 
 /*
+ * raw_expression_tree_mutator --- transform raw parse tree. 
+ *
+ * This function is implementing slightly different approach for tree update than expression_tree_mutator().
+ * Callback is given pointer to pointer to the current node and can update this field instead of returning reference to new node.
+ * It makes it possible to remember changes and easily revert them without extra traversal of the tree.
+ * 
+ * This function do not need QTW_DONT_COPY_QUERY flag: it never implicitly copy tree nodes, doing in-place update.
+ * 
+ * Like raw_expression_tree_walker, there is no special rule about query
+ * boundaries: we descend to everything that's possibly interesting.
+ *
+ * Currently, the node type coverage here extends only to DML statements
+ * (SELECT/INSERT/UPDATE/DELETE) and nodes that can appear in them, because
+ * this is used mainly during analysis of CTEs, and only DML statements can
+ * appear in CTEs. If some other node is visited, iteration is immediately stopped and true is returned.
+ */
+bool
+raw_expression_tree_mutator(Node *node,
+			bool (*mutator) (),
+			void *context)
+{
+	ListCell   *temp;
+
+	/*
+	 * The walker has already visited the current node, and so we need only
+	 * recurse into any sub-nodes it has.
+	 */
+	if (node == NULL)
+		return false;
+
+	/* Guard against stack overflow due to overly complex expressions */
+	check_stack_depth();
+
+	switch (nodeTag(node))
+	{
+		case T_SetToDefault:
+		case T_CurrentOfExpr:
+		case T_Integer:
+		case T_Float:
+		case T_String:
+		case T_BitString:
+		case T_Null:
+		case T_ParamRef:
+		case T_A_Const:
+		case

[HACKERS] Bug in prepared statement cache invalidation?

2017-04-28 Thread Konstantin Knizhnik

Hi hackers,

I find out that now Postgres correctly invalidates prepared plans which 
directly depend on altered relation, but doesn't invalidate plans having 
transitive (indirect) dependencies.

Is it a bug or feature?

postgres=# create table foo(x integer);
CREATE TABLE
postgres=# select * from foo;
 x
---
(0 rows)

postgres=# create function returnqueryf()returns setof foo  as $$ begin 
return query select * from foo; end; $$ language plpgsql;

CREATE FUNCTION
postgres=# select * from returnqueryf();
 x
---
(0 rows)

postgres=# create function returnqueryff()returns setof foo  as $$ begin 
return query select * from returnqueryf(); end; $$ language plpgsql;

CREATE FUNCTION
postgres=# select * from returnqueryff();
 x
---
(0 rows)

postgres=# alter table foo add column y integer;
ALTER TABLE
postgres=# select * from foo;
 x | y
---+---
(0 rows)

postgres=# select * from returnqueryf();
 x | y
---+---
(0 rows)

postgres=# select * from returnqueryff();
ERROR:  structure of query does not match function result type
DETAIL:  Number of returned columns (1) does not match expected column 
count (2).

CONTEXT:  PL/pgSQL function returnqueryff() line 1 at RETURN QUERY
p

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-04-26 Thread Konstantin Knizhnik

On 04/26/2017 08:08 PM, Doug Doole wrote:


A naive option would be to invalidate anything that depends on table or 
view *.FOOBAR. You could probably make it a bit smarter by also requiring that 
schema A appear in the path.


This has been rumbling around in my head. I wonder if you could solve this 
problem by registering dependencies on objects which don't yet exist. Consider:

CREATE TABLE C.T1(...);
CREATE TABLE C.T2(...);
SET search_path='A,B,C,D';
SELECT * FROM C.T1, T2;

For T1, you'd register a hard dependency on C.T1 and no virtual dependencies 
since the table is explicitly qualified.

For T2, you'd register a hard dependency on C.T2 since that is the table that was selected for the query. You'd also register virtual dependencies on A.T2 and B.T2 since if either of those tables (or views) are created you need to recompile the 
statement. (Note that no virtual dependency is created on D.T2() since that table would never be selected by the compiler.)


The catch is that virtual dependencies would have to be recorded and searched 
as strings, not OIDs since the objects don't exist. Virtual dependencies only 
have to be checked during CREATE processing though, so that might not be too 
bad.

But this is getting off topic - I just wanted to capture the idea while it was 
rumbling around.


I think that it will be enough to handle modification of search path and 
invalidate prepared statements cache in this case.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Cached plans and statement generalization

2017-04-26 Thread Konstantin Knizhnik



On 26.04.2017 10:49, Konstantin Knizhnik wrote:



On 26.04.2017 04:00, Tsunakawa, Takayuki wrote:   Are you considering 
some upper limit on the number of prepared statements?
In this case we need some kind of LRU for maintaining cache of 
autoprepared statements.
I think that it is good idea to have such limited cached - it can 
avoid memory overflow problem.

I will try to implement it.


I attach new patch which allows to limit the number of autoprepared 
statements (autoprepare_limit GUC variable).
Also I did more measurements, now with several concurrent connections 
and read-only statements.
Results of pgbench with 10 connections, scale 10 and read-only 
statements are below:


Protocol
TPS
extended
87k
prepared
209k
simple+autoprepare
206k


As you can see, autoprepare provides more than 2 times speed improvement.

Also I tried to measure overhead of parsing (to be able to substitute 
all literals, not only string literals).

I just added extra call of pg_parse_query. Speed is reduced to 181k.
So overhead is noticeable, but still making such optimization useful.
This is why I want to ask question:  is it better to implement slower 
but safer and more universal solution?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f6be98b..0c9abfc 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -188,6 +188,7 @@ static bool IsTransactionStmtList(List *parseTrees);
 static void drop_unnamed_stmt(void);
 static void SigHupHandler(SIGNAL_ARGS);
 static void log_disconnections(int code, Datum arg);
+static bool exec_cached_query(const char *query_string);
 
 
 /* 
@@ -916,6 +917,14 @@ exec_simple_query(const char *query_string)
 	drop_unnamed_stmt();
 
 	/*
+	 * Try to find cached plan
+	 */
+	if (autoprepare_threshold != 0 && exec_cached_query(query_string))
+	{
+		return;
+	}
+
+	/*
 	 * Switch to appropriate context for constructing parsetrees.
 	 */
 	oldcontext = MemoryContextSwitchTo(MessageContext);
@@ -4500,3 +4509,606 @@ log_disconnections(int code, Datum arg)
 	port->user_name, port->database_name, port->remote_host,
   port->remote_port[0] ? " port=" : "", port->remote_port)));
 }
+
+typedef struct { 
+	char const*   query;
+	dlist_nodelru;
+	int64 exec_count;
+	CachedPlanSource* plan;	
+	int   n_params;
+	int16 format;
+	bool  disable_autoprepare;
+} plan_cache_entry;
+
+/*
+ * Replace string literals with parameters. We do not consider integer or real literals to avoid problems with 
+ * negative number, user defined operators, ... For example it is not easy to distinguish cases (-1), (1-1), (1-1)-1
+ */
+static void generalize_statement(const char *query_string, char** gen_query, char** query_params, int* n_params)
+{
+	size_t query_len = strlen(query_string);
+	char const* src = query_string;
+	char* dst;
+	char* params;
+	unsigned char ch;
+
+	*n_params = 0;
+
+	*gen_query = (char*)palloc(query_len*2); /* assume that we have less than 1000 parameters, the worst case is replacing '' with $999 */
+	*query_params = (char*)palloc(query_len + 1);
+	dst = *gen_query;
+	params = *query_params;
+
+	while ((ch = *src++) != '\0') { 
+		if (isspace(ch)) { 
+			/* Replace sequence of whitespaces with just one space */
+			while (*src && isspace(*(unsigned char*)src)) { 
+src += 1;
+			}
+			*dst++ = ' ';
+		} else if (ch == '\'') { 
+			while (true) { 
+ch = *src++;
+if (ch == '\'') { 
+	if (*src != '\'') { 
+		break;
+	} else {
+		/* escaped quote */
+		*params++ = '\'';
+		src += 1;
+	}
+} else { 
+	*params++ = ch;
+}
+			}
+			*params++ = '\0';
+			dst += sprintf(dst, "$%d", ++*n_params);
+		} else { 
+			*dst++ = ch;
+		}
+	}			
+	Assert(dst <= *gen_query + query_len);
+	Assert(params <= *query_params + query_len*2);
+	*dst = '\0';
+}
+
+static uint32 plan_cache_hash_fn(const void *key, Size keysize)
+{
+	return string_hash(((plan_cache_entry*)key)->query, 0);
+}
+
+static int plan_cache_match_fn(const void *key1, const void *key2, Size keysize)
+{
+	return strcmp(((plan_cache_entry*)key1)->query, ((plan_cache_entry*)key2)->query);
+}
+
+static void* plan_cache_keycopy_fn(void *dest, const void *src, Size keysize)
+{ 
+	((plan_cache_entry*)dest)->query = pstrdup(((plan_cache_entry*)src)->query);
+return dest;
+}
+
+#define PLAN_CACHE_SIZE 113
+
+size_t nPlanCacheHits;
+size_t nPlanCacheMisses;
+
+/*
+ * Try to generalize query, find cached plan for it and execute
+ */
+static bool exec_cached_query(const char *query_string)
+{
+	CommandDest   dest = whereToSendOutput;
+	DestReceiver *receiver;
+	char *gen_query;
+	char 

Re: [HACKERS] Cached plans and statement generalization

2017-04-26 Thread Konstantin Knizhnik



On 26.04.2017 04:00, Tsunakawa, Takayuki wrote:

From: pgsql-hackers-ow...@postgresql.org

[mailto:pgsql-hackers-ow...@postgresql.org] On Behalf Of Konstantin
Knizhnik
Well, first of all I want to share results I already get: pgbench with default
parameters,  scale 10 and one connection:

So autoprepare is as efficient as explicit prepare and can increase
performance almost two times.

This sounds great.

BTW, when are the autoprepared statements destroyed?
Right now them are destroyed only in case of receiving invalidation  
message (when catalog is changed).
Prepared statements are local to backend and are located in backend's  
memory.
It is unlikely, that there will be too much different queries which  
cause memory overflow.

But in theory such situation is certainly possible.



  Are you considering some upper limit on the number of prepared statements?
In this case we need some kind of LRU for maintaining cache of  
autoprepared statements.
I think that it is good idea to have such limited cached - it can avoid  
memory overflow problem.

I will try to implement it.



Regards
Takayuki Tsunakawa




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-04-26 Thread Konstantin Knizhnik



On 26.04.2017 01:34, Andres Freund wrote:

Hi,

(FWIW, on this list we don't do top-quotes)

On 2017-04-25 22:21:22 +, Doug Doole wrote:

Plan invalidation was no different than for any SQL statement. DB2 keeps a
list of the objects the statement depends on. If any of the objects changes
in an incompatible way the plan is invalidated and kicked out of the cache.

I suspect what is more interesting is plan lookup. DB2 has something called
the "compilation environment". This is a collection of everything that
impacts how a statement is compiled (SQL path, optimization level, etc.).
Plan lookup is done using both the statement text and the compilation
environment. So, for example, if my path is DOUG, MYTEAM, SYSIBM and your
path is ANDRES, MYTEAM, SYSIBM we will have different compilation
environments. If we both issue "SELECT * FROM T" we'll end up with
different cache entries even if T in both of our statements resolves to
MYTEAM.T. If I execute "SELECT * FROM T", change my SQL path and then
execute "SELECT * FROM T" again, I have a new compilation environment so
the second invocation of the statement will create a new entry in the
cache. The first entry is not kicked out - it will still be there for
re-use if I change my SQL path back to my original value (modulo LRU for
cache memory management of course).

It's not always that simple, at least in postgres, unless you disregard
search_path.  Consider e.g. cases like

CREATE SCHEMA a;
CREATE SCHEMA b;
CREATE TABLE a.foobar(somecol int);
SET search_patch = 'b,a';
SELECT * FROM foobar;
CREATE TABLE b.foobar(anothercol int);
SELECT * FROM foobar; -- may not be cached plan from before!

it sounds - my memory of DB2 is very faint, and I never used it much -
like similar issues could arise in DB2 too?


There is the same problem with explicitly prepared statements, isn't it?
Certainly in case of using prepared statements it is responsibility of 
programmer to avoid such collisions.

And in case of autoprepare programmer it is hidden from programming.
But there is guc variable controlling autoprepare feature and by default 
it is switched off.
So if programmer or DBA enables it, then them should take in account 
effects of such decision.


By the way, isn't it a bug in PostgreSQL that altering search path is 
not invalidating cached plans?
As I already mentioned, the same problem can be reproduced with 
explicitly prepared statements.






Greetings,

Andres Freund


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-04-26 Thread Konstantin Knizhnik



On 26.04.2017 00:47, Andres Freund wrote:

On 2017-04-25 21:11:08 +, Doug Doole wrote:

When I did this in DB2, I didn't use the parser - it was too expensive. I
just tokenized the statement and used some simple rules to bypass the
invalid cases. For example, if I saw the tokens "ORDER" and "BY" then I'd
disallow replacement replacement until I hit the end of the current
subquery or statement.

How did you manage plan invalidation and such?


The same mechanism as for prepared statements.
Cached plans are linked in the list by SaveCachedPlan function and are 
invalidated by PlanCacheRelCallback.





- Andres




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-04-25 Thread Konstantin Knizhnik

On 04/25/2017 11:40 PM, Serge Rielau wrote:



On Apr 25, 2017, at 1:37 PM, Konstantin Knizhnik <k.knizh...@postgrespro.ru 
<mailto:k.knizh...@postgrespro.ru>> wrote:



SELECT ‘hello’::CHAR(10) || ‘World’, 5 + 6;

You can substitute ‘hello’, ‘World’, 5, and 6. But not 10.


I am substituting only string literals. So the query above will be transformed 
to

SELECT $1::CHAR(10) || $2, 5 + 6;

What's wrong with it?


Oh, well that leaves a lot of opportunities on the table, doesn’t it?


Well, actually my primary intention was not to make badly designed programs 
(not using prepared statements) work faster.
I wanted to address cases when it is not possible to use prepared statements.
If we want to substitute with parameters as much literals as possible, then 
parse+deparse tree seems to be the only reasonable approach.
I will try to implement it also, just to estimate parsing overhead.





Cheers
Serge




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Cached plans and statement generalization

2017-04-25 Thread Konstantin Knizhnik

On 04/25/2017 08:09 PM, Serge Rielau wrote:


On Tue, Apr 25, 2017 at 9:45 AM, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru> wrote:

On 25.04.2017 19:12, Serge Rielau wrote:


On Apr 25, 2017, at 8:11 AM, Konstantin Knizhnik <k.knizh...@postgrespro.ru 
<mailto:k.knizh...@postgrespro.ru>> wrote:
Another problem is caused by using integer literals in context where 
parameters can not be used, for example "order by 1”.

You will also need to deal with modifiers in types such as VARCHAR(10). 
Not sure if there are specific functions which can only deal with literals (?) 
as well.

Sorry, I do not completely understand how presence of type modifiers can 
affect string literals used in query.
Can you provide me some example?

SELECT ‘hello’::CHAR(10) || ‘World’, 5 + 6;

You can substitute ‘hello’, ‘World’, 5, and 6. But not 10.


I am substituting only string literals. So the query above will be transformed 
to

SELECT $1::CHAR(10) || $2, 5 + 6;

What's wrong with it?



Also some OLAP syntax like “rows preceding”

It pretty much boils down to whether you can do some shallow parsing rather 
than expending the effort to build the parse tree.

Cheers
Serge



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Cached plans and statement generalization

2017-04-25 Thread Konstantin Knizhnik

On 04/25/2017 07:54 PM, David Fetter wrote:

On Tue, Apr 25, 2017 at 06:11:09PM +0300, Konstantin Knizhnik wrote:

On 24.04.2017 21:43, Andres Freund wrote:

Hi,

On 2017-04-24 11:46:02 +0300, Konstantin Knizhnik wrote:

So what I am thinking now is implicit query caching. If the same query with
different literal values is repeated many times, then we can try to
generalize this query and replace it with prepared query with
parameters.

That's not actuall all that easy:
- You pretty much do parse analysis to be able to do an accurate match.
How much overhead is parse analysis vs. planning in your cases?
- The invalidation infrastructure for this, if not tied to to fully
parse-analyzed statements, is going to be hell.
- Migrating to parameters can actually cause significant slowdowns, not
nice if that happens implicitly.

Well, first of all I want to share results I already get: pgbench with
default parameters,  scale 10 and one connection:

protocol
TPS
simple
3492
extended
2927
prepared
6865
simple + autoprepare
6844

If this is string mashing on the unparsed query, as it appears to be,
it's going to be a perennial source of security issues.


Sorry, may be I missed something, but I can not understand how security can be violated by extracting string literals from query. I am just copying bytes from one buffer to another. I do not try to somehow interpret this parameters.  What I am doing is 
very similar with standard prepared statements.

And moreover query is parsed! Only query which was already parsed and executed 
(but with different values of parameters) can be autoprepared.




Best,
David.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Cached plans and statement generalization

2017-04-25 Thread Konstantin Knizhnik



On 25.04.2017 19:12, Serge Rielau wrote:


On Apr 25, 2017, at 8:11 AM, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:
Another problem is caused by using integer literals in context where 
parameters can not be used, for example "order by 1”.
You will also need to deal with modifiers in types such as 
VARCHAR(10). Not sure if there are specific functions which can only 
deal with literals (?) as well.


Sorry, I do not completely understand how presence of type modifiers can 
affect string literals used in query.

Can you provide me some example?



Doug Doole did this work in DB2 LUW and he may be able to point to 
more places to watch out for semantically.


Generally, in my experience, this feature is very valuable when 
dealing with (poorly designed) web apps that just glue together strings.


I do not think that this optimization will be useful only for poorly 
designed application.
I already pointed on two use cases where prepapred statements can not be 
used:

1. pgbouncer without session-level pooling.
2. partitioning

Protecting it under a GUC would allow to only do the work if it’s 
deemed likely to help.
Another rule I find useful is to abort any efforts to substitute 
literals if any bind variable is found in the query.
That can be used as a cue that the author of the SQL left the 
remaining literals in on purpose.


A follow up feature would be to formalize different flavors of peeking.
I.e. can you produce a generic plan, but still recruit the initial set 
of bind values/substituted literals to dos costing?

Here situation is the same as for explicitly prepared statements, isn't it?
Sometimes it is preferrable to use specialized plan rather than generic 
plan.

I am not sure if postgres now is able to do it.




Cheers
Serge Rielau
Salesforce.com <http://salesforce.com>

PS: FWIW, I like this feature.


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Cached plans and statement generalization

2017-04-25 Thread Konstantin Knizhnik

On 24.04.2017 21:43, Andres Freund wrote:

Hi,

On 2017-04-24 11:46:02 +0300, Konstantin Knizhnik wrote:

So what I am thinking now is implicit query caching. If the same query with
different literal values is repeated many times, then we can try to
generalize this query and replace it with prepared query with
parameters.

That's not actuall all that easy:
- You pretty much do parse analysis to be able to do an accurate match.
   How much overhead is parse analysis vs. planning in your cases?
- The invalidation infrastructure for this, if not tied to to fully
   parse-analyzed statements, is going to be hell.
- Migrating to parameters can actually cause significant slowdowns, not
   nice if that happens implicitly.


Well, first of all I want to share results I already get: pgbench with 
default parameters,  scale 10 and one connection:


protocol
TPS
simple
3492
extended
2927
prepared
6865
simple + autoprepare
6844


So autoprepare is as efficient as explicit prepare and can increase 
performance almost two times.


My current implementation is replacing with parameters only string 
literals in the query, i.e. select * from T where x='123'; -> select * 
from T where x=$1;
It greatly simplifies matching of parameters - it is just necessary to 
locate '\'' character and then correctly handle pairs  of quotes.

Handling of integer and real literals is really challenged task.
One source of problems is negation: it is not so easy to correctly 
understand whether minus should be treated as part of literal or as 
operator:

(-1), (1-1), (1-1)-1
Another problem is caused by using integer literals in context where 
parameters can not be used, for example "order by 1".


Fully correct substitution can be done by first performing parsing the 
query, then transform parse tree, replacing literal nodes with parameter 
nodes and finally deparse tree into generalized query. postgres_fdw 
already contains such deparse code. It can be moved to postgres core and 
reused for autoprepare (and may be somewhere else).

But in this case overhead will be much higher.
I still think that query parsing time is significantly smaller than time 
needed for building and optimizing query execution plan.

But it should be measured if community will be interested in such approach.

There is obvious question: how I managed to get this pgbench results if 
currently only substitution of string literals is supported and queries 
constructed by pgbench don't contain string literals? I just made small 
patch in pgbench replaceVariable method wrapping value's representation 
in quotes. It has almost no impact on performance (3482 TPS  vs. 3492 TPS),

but allows autoprepare to deal with pgbench queries.

I attached my patch to this mail. It is just first version of the patch 
(based on REL9_6_STABLE branch) just to illustrate proposed approach.
I will be glad to receive any comments and if such optimization is 
considered to be useful, I will continue work on this patch.


--

Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

diff --git a/src/backend/tcop/postgres.c b/src/backend/tcop/postgres.c
index f6be98b..6291d66 100644
--- a/src/backend/tcop/postgres.c
+++ b/src/backend/tcop/postgres.c
@@ -96,8 +96,6 @@ int			max_stack_depth = 100;
 /* wait N seconds to allow attach from a debugger */
 int			PostAuthDelay = 0;
 
-
-
 /* 
  *		private variables
  * 
@@ -188,6 +186,7 @@ static bool IsTransactionStmtList(List *parseTrees);
 static void drop_unnamed_stmt(void);
 static void SigHupHandler(SIGNAL_ARGS);
 static void log_disconnections(int code, Datum arg);
+static bool exec_cached_query(const char *query_string);
 
 
 /* 
@@ -916,6 +915,14 @@ exec_simple_query(const char *query_string)
 	drop_unnamed_stmt();
 
 	/*
+	 * Try to find cached plan
+	 */
+	if (autoprepare_threshold != 0 && exec_cached_query(query_string))
+	{
+		return;
+	}
+
+	/*
 	 * Switch to appropriate context for constructing parsetrees.
 	 */
 	oldcontext = MemoryContextSwitchTo(MessageContext);
@@ -4500,3 +4509,566 @@ log_disconnections(int code, Datum arg)
 	port->user_name, port->database_name, port->remote_host,
   port->remote_port[0] ? " port=" : "", port->remote_port)));
 }
+
+
+
+typedef struct { 
+	char const* query;
+	int64 exec_count;
+	CachedPlanSource* plan;	
+	int n_params;
+	int16 format;
+} plan_cache_entry;
+
+
+/*
+ * Replace string literals with parameters. We do not consider integer or real literals to avoid problems with 
+ * negative number, user defined operators, ... For example it is not easy to distinguish cases (-1), (1-1), (1-1)-1
+ */
+static void generalize_statement(const char *query_string, char** gen_query, char** query_params, int* n_params)
+{
+	size_t query_len = strlen(query_string);
+	cha

Re: [HACKERS] Cached plans and statement generalization

2017-04-24 Thread Konstantin Knizhnik



On 24.04.2017 13:24, Alexander Korotkov wrote:

Hi, Konstantin!

On Mon, Apr 24, 2017 at 11:46 AM, Konstantin Knizhnik 
<k.knizh...@postgrespro.ru <mailto:k.knizh...@postgrespro.ru>> wrote:


There were a lot of discussions about query plan caching in
hackers mailing list, but I failed to find some clear answer for
my question and the current consensus on this question in Postgres
community. As far as I understand current state is the following:
1. We have per-connection prepared statements.
2. Queries executed inside plpgsql code are implicitly prepared.

It is not always possible or convenient to use prepared statements.
For example, if pgbouncer is used to perform connection pooling.
Another use case (which is actually the problem I am trying to
solve now) is partitioning.
Efficient execution of query to partitioned table requires
hardcoded value for partitioning key.
Only in this case optimizer will be able to construct efficient
query plan which access only affected tables (partitions).

My small benchmark for distributed partitioned table based on
pg_pathman + postgres_fdw shows 3 times degrade of performance in
case of using prepared statements.
But without prepared statements substantial amount of time is
spent in query compilation and planning. I was be able to speed up
benchmark more than two time by
sending prepared queries directly to the remote nodes.


I don't think it's correct to ask PostgreSQL hackers about problem 
which arises with pg_pathman while pg_pathman is an extension 
supported by Postgres Pro.
Since we have declarative partitioning committed to 10, I think that 
community should address this issue in the context of declarative 
partitioning.
However, it's unlikely we can spot this issue with declarative 
partitioning because it still uses very inefficient constraint 
exclusion mechanism.  Thus, issues you are writing about would become 
visible on declarative partitioning only when constraint exclusion 
would be replaced with something more efficient.


Long story short, could you reproduce this issue without pg_pathman?



Sorry, I have mentioned pg_pathman just as example.
The same problems takes place with partitioning based on standard 
Postgres inheritance mechanism (when I manually create derived tables 
and specify constraints for them).
I didn't test yet declarative partitioning committed to 10, but I expect 
the that it will also suffer from this problem (because is based on 
inheritance).
But as I wrote, I think that the problem with plan caching is wider and 
is not bounded just to partitioning.





--
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com 
<http://www.postgrespro.com/>

The Russian Postgres Company




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[HACKERS] Cached plans and statement generalization

2017-04-24 Thread Konstantin Knizhnik

Hi hackers,

There were a lot of discussions about query plan caching in hackers 
mailing list, but I failed to find some clear answer for my question and 
the current consensus on this question in Postgres community. As far as 
I understand current state is the following:

1. We have per-connection prepared statements.
2. Queries executed inside plpgsql code are implicitly prepared.

It is not always possible or convenient to use prepared statements.
For example, if pgbouncer is used to perform connection pooling.
Another use case (which is actually the problem I am trying to solve 
now) is partitioning.
Efficient execution of query to partitioned table requires hardcoded 
value for partitioning key.
Only in this case optimizer will be able to construct efficient query 
plan which access only affected tables (partitions).


My small benchmark for distributed partitioned table based on pg_pathman 
+ postgres_fdw shows 3 times degrade of performance in case of using 
prepared statements.
But without prepared statements substantial amount of time is spent in 
query compilation and planning. I was be able to speed up benchmark more 
than two time by

sending prepared queries directly to the remote nodes.

So what I am thinking now is implicit query caching. If the same query 
with different literal values is repeated many times, then we can try to 
generalize this query and replace it with prepared query with 
parameters. I am not considering now shared query cache: is seems to be 
much harder to implement. But local caching of generalized queries seems 
to be not so difficult to implement and requires not so much changes in 
Postgres code. And it can be useful not only for sharding, but for many 
other cases where prepared statements can not be used.


I wonder if such option was already considered and if it was for some 
reasons rejected: can you point me at this reasons?


--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] FDW and parallel execution

2017-04-11 Thread Konstantin Knizhnik



On 04.04.2017 13:29, Kyotaro HORIGUCHI wrote:

Hi,

At Sun, 02 Apr 2017 16:30:24 +0300, Konstantin Knizhnik <k.knizh...@postgrespro.ru> 
wrote in <58e0fcf0.2070...@postgrespro.ru>

Hi hackers and personally Robet (you are the best expert in both
areas).
I want to ask one more question concerning parallel execution and FDW.
Below are two plans for the same query (TPC-H Q5): one for normal
tables, another for FDW to vertical representation of the same data.
FDW supports analyze function and is expected to produce the similar
statistic as for original tables.



The plans look very similar, but first one is parallel and second -
not.
My FDW provides implementation for IsForeignScanParallelSafe which
returns true.
I wonder what can prevent optimizer from using parallel plan in this
case?

Parallel execution requires partial paths. It's the work for
GetForeignPaths of your FDW.


Thank you very much for explanation.
But unfortunately I still do not completely understand what kind of 
queries allow parallel execution with FDW.


Section "FDW Routines for Parallel Execution" of FDW specification says:
A ForeignScan node can, optionally, support parallel execution. A 
parallel ForeignScan will be executed in multiple processes and should 
return each row only once across all cooperating processes. To do 
this, processes can coordinate through fixed size chunks of dynamic 
shared memory. This shared memory is not guaranteed to be mapped at 
the same address in every process, so pointers may not be used. The 
following callbacks are all optional in general, but required if 
parallel execution is to be supported.


I provided IsForeignScanParallelSafe, EstimateDSMForeignScan, 
InitializeDSMForeignSca and InitializeWorkerForeignScan in my FDW.

IsForeignScanParallelSafe returns true.
Also in GetForeignPaths function I created path with 
baserel->consider_parallel == true.

Is it enough or I should do something else?

But unfortunately I failed to find any query: sequential scan, grand 
aggregation, aggregation with group by, joins... when parallel execution 
plan is used for this FDW.
Also there are no examples of using this functions in Postgres 
distributive and I failed to find any such examples in Internet.


Can somebody please clarify my situation with parallel execution and FDW 
and may be point at some examples?

Thank in advance.

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[HACKERS] FDW and parallel execution

2017-04-02 Thread Konstantin Knizhnik
t=20.05..40.79 rows=15 width=36)
 Hash Cond: 
(nation.n_regionkey = region.r_regionkey)
 ->  Seq Scan on nation 
 (cost=0.00..17.70 rows=770 width=40)
 -> Hash  
(cost=20.00..20.00 rows=4 width=4)
->  Seq Scan on region  (cost=0.00..20.00 rows=4 width=4)
Filter: ((r_name)::text = 'ASIA'::text)
   ->  Hash  (cost=294718.76..294718.76 rows=2284376 
width=8)
 ->  Foreign Scan on orders_fdw 
(cost=0.00..294718.76 rows=2284376 width=8)
 ->  Hash  (cost=32605.64..32605.64 rows=1500032 width=8)
   ->  Foreign Scan on customer_fdw 
(cost=0.00..32605.64 rows=1500032 width=8)

The plans look very similar, but first one is parallel and second - not.
My FDW provides implementation for IsForeignScanParallelSafe which returns true.
I wonder what can prevent optimizer from using parallel plan in this case?

Thank in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] Parallel query execution with SPI

2017-03-31 Thread Konstantin Knizhnik



On 31.03.2017 13:48, Robert Haas wrote:

On Fri, Mar 31, 2017 at 3:33 AM, Konstantin Knizhnik
<k.knizh...@postgrespro.ru> wrote:

It is possible to execute query concurrently using SPI?
If so, how it can be enforced?
I tried to open cursor with CURSOR_OPT_PARALLEL_OK flag but it doesn't help:
query is executed by single backend while the same query been launched at
top level uses parallel plan:

 fsstate->portal = SPI_cursor_open_with_args(NULL, fsstate->query,
fsstate->numParams, argtypes, values, nulls, true, CURSOR_OPT_PARALLEL_OK);
 ...
 SPI_cursor_fetch(fsstate->portal, true, 1);

Parallel execution isn't possible if you are using a cursor-type
interface, because a parallel query can't be suspended and resumed
like a non-parallel query.  If you use a function that executes the
query to completion in one go, like SPI_execute_plan, then it's cool.
See also commit 61c2e1a95f94bb904953a6281ce17a18ac38ee6d.


Thank you very much for explanation.
In case of using SPI_execute the query is really executed concurrently.
But it means that when I am executing some query using SPI, I need to 
somehow predict number of returned tuples.
If it is not so much, then it is better to use SPI_execute to allow 
concurrent execution of the query.
But if it is large enough, then SPI_execute without limit can cause 
memory overflow.
Certainly I can specify some reasonable limit and it if is reached, then 
use cursor instead.

But it is neither convenient, neither efficient.

I wonder if somebody can suggest better solution?

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Parallel query execution with SPI

2017-03-31 Thread Konstantin Knizhnik

Hi hackers,

It is possible to execute query concurrently using SPI?
If so, how it can be enforced?
I tried to open cursor with CURSOR_OPT_PARALLEL_OK flag but it doesn't 
help: query is executed by single backend while the same query been 
launched at top level uses parallel plan:


fsstate->portal = SPI_cursor_open_with_args(NULL, fsstate->query, 
fsstate->numParams, argtypes, values, nulls, true, CURSOR_OPT_PARALLEL_OK);

...
SPI_cursor_fetch(fsstate->portal, true, 1);

Thanks in advance,

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] WIP: Faster Expression Processing v4

2017-03-13 Thread Konstantin Knizhnik



On 13.03.2017 11:03, Andres Freund wrote:

Hi,

On 2017-03-12 05:40:51 +0100, Tomas Vondra wrote:

I wanted to do a bit of testing and benchmarking on this, but 0004 seems to
be a bit broken.

Well, "broken" in the sense that it's already outdated, because other
stuff that got merged.



The patch does not apply anymore - there are some conflicts
in execQual.c, but I think I fixed those. But then I ran into a bunch of
compile-time errors, because some of the executor nodes still reference bits
that were moved elsewhere.

Updated patch attached.  Note that this patch has two changes I've not
yet evaluated performance-wise.



I got the following results at my system with Intel(R) Core(TM) i7-4770 
CPU @ 3.40GHz, 16Gb RAM,
TPC-H Q1/Q6 scale 10, sharedBuffers=8Gb, pg_prewarm on lineitem table 
projection:



Q1
Q6
Master
7503 ms 1171 ms
Your patch  6420 ms 1034 ms
VOPS
396 ms
249 ms
VOPS + patch367 ms
233 ms



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



[HACKERS] Two questions about Postgres parser

2017-02-27 Thread Konstantin Knizhnik

Hi hackers,

Working on vectorized extension for Postgres (VOPS) I faced with two 
things in Postgres compiler which break my expectations and force me to 
abandon my original implementation plan. I wonder if it is really 
principle and correct that:


1. Moving-aggregate implementation should return the same type as plain 
implementation. Yes, in most cases it is hard to find arguments why them 
should return different types. But it is not true for vectorized 
operations...


2. Implicit user defined type casts are not applied for COALESCE operator:

create  type complex as (x float8, y float8);
create function float2complex(x float8) returns complex as $$ 
declare complex c; begin c.x := x; x.y = 0; return c; $$ language 
plpgsql strict immutable;
create cast (float8 as complex) with function float2complex(float8) 
as implicit;

create table foo(c complex);
select coalesce(c, 0.0) from foo;
ERROR:  COALESCE types complex and numeric cannot be matched
LINE 1: select coalesce(c, 0.0) from foo;

select coalesce(c, 0.0::float8) from foo;
ERROR:  COALESCE types complex and double precision cannot be matched
LINE 1: select coalesce(c, 0.0::float8) from foo;

select coalesce(c, 0.0::float8::complex) from foo;
 coalesce
--
(0 rows)

--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] VOPS: vectorized executor for Postgres: how to speedup OLAP queries more than 10 times without changing anything in Postgres executor

2017-02-16 Thread Konstantin Knizhnik
More progress in vectorized Postgres extension (VOPS). It is not 
required any more to use some special functions in queries.
You can use vector operators in query with standard SQL and still get 
ten times improvement on some queries.

VOPS extension now uses post parse analyze hook to transform query.
I really impressed by flexibility and extensibility of Postgres type 
system. User defined types do most of the work.


It is still responsibility of programmer or database administrator to 
create proper projections
of original table. This projections need to use tiles types for some 
attributes (vops_float4,...).
Then you can query this table using standard SQL. And this query will be 
executed using vector operations!


Example of such TPC-H queries:

Q1:
select
l_returnflag,
l_linestatus,
sum(l_quantity) as sum_qty,
sum(l_extendedprice) as sum_base_price,
sum(l_extendedprice*(1-l_discount)) as sum_disc_price,
sum(l_extendedprice*(1-l_discount)*(1+l_tax)) as sum_charge,
avg(l_quantity) as avg_qty,
avg(l_extendedprice) as avg_price,
avg(l_discount) as avg_disc,
count(*) as count_order
from
vops_lineitem_projection
where
l_shipdate <= '1998-12-01'::date
group by
l_returnflag,
l_linestatus
order by
l_returnflag,
l_linestatus;



Q6:
select
sum(l_extendedprice*l_discount) as revenue
from
lineitem_projection
where
l_shipdate between '1996-01-01'::date and '1997-01-01'::date
and l_discount between 0.08 and 0.1
and l_quantity < 24;



On 13.02.2017 17:12, Konstantin Knizhnik wrote:

Hello hackers,

There were many discussions concerning  possible ways of speeding-up 
Postgres. Different approaches were suggested:


- JIT (now we have three different prototype implementations based on 
LLVM)

- Chunked (vectorized) executor
- Replacing pull with push
- Columnar store (cstore_fdw, IMCS)
- Optimizing and improving current executor (reducing tuple deform 
overhead, function call overhead,...)


Obviously the best result can be achieved in case of combining all 
this approaches. But actually them are more or less interchangeable: 
vectorized execution is not eliminating interpretation overhead, but 
it is divided by vector size and becomes less critical.


I decided to write small prototype to estimate possible speed 
improvement of vectorized executor. I created special types 
representing "tile" and implement standard SQL operators for them. So 
neither Postgres planer, nether Postgres executor, nether Postgres 
heap manager are changed. But I was able to reach more than 10 times 
speed improvement on TPC-H Q1/Q6 queries!


Please find more information here: 
https://cdn.rawgit.com/postgrespro/vops/ddcbfbe6/vops.html
The sources of the project can be found here: 
https://github.com/postgrespro/vops.git




--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Sum aggregate calculation for single precsion real

2017-02-14 Thread Konstantin Knizhnik



On 14.02.2017 16:59, Jim Nasby wrote:

On 2/13/17 10:45 AM, Konstantin Knizhnik wrote:

It is not true - please notice query execution time of this two queries:


I bet you'd get even less difference if you simply cast to float8 
instead of adding 0.0. Same result, no floating point addition.



The expectation for SUM(float4) is that you want speed and are
prepared to cope with the consequences.  It's easy enough to cast your
input to float8 if you want a wider accumulator, or to numeric if
you'd like more stable (not necessarily more accurate :-() results.
I do not think it's the database's job to make those choices for you.


From my point of your it is strange and wrong expectation.
I am choosing "float4" type for a column just because it is enough to
represent range of data I have and I need to minimize size of record.


In other words, you've decided to trade accuracy for performance...


Could not agree with it...
1. If I choose float4 type to store bid price (which usually has 5-6 
significant digits) - I do not loose precision and accuracy is not suffered.
The accuracy is important when I am calculating sum of prices. But here 
the assumption that accuracy of sum calculation should depend on type of 
summed field
is non obvious. May be it is more or less clear for C programmers but 
not for SQL users.
In all database I have tested SUM  of single precision floats is 
calculated at least using double precision numbers (or using numeric type).


2. There is no huge gap in performance between accumulating  in float4 
and float8. There are no "orders of magnitude":

postgres=# select sum(l_quantity) from lineitem_projection;
 sum
-
 1.07374e+09
(1 row)

Time: 4659.509 ms (00:04.660)

postgres=# select sum(l_quantity::float8) from lineitem_projection;
sum

 1529738036
(1 row)

Time: 5465.320 ms (00:05.465)


So do not think that there is actually compromise here between 
performance and accuracy.
But current implementation cause leads to many confusions and 
contradictions with users expectations:


1. The fact that sum(l_quantity) and sum(l_quantity::float8) are 
absolutely different (1.5 times!!! - we loose 0.5 milliard dollars:)
2. avg(l_quantity)*count(l_quantity) is not equal to sum(l_quantity) But 
in case of casting to float8 result is the same.
3. sum of aggregates for groups is not equal to total sum (once again no 
problem for float8 type_/



But when I am calculating sum, I expect to receive more or less precise
result. Certainly I realize that even in case of using double it is


... but now you want to trade performance for accuracy? Why would you 
expect the database to magically come to that conclusion?


Se above. No trading here. Please notice that current Postgres 
implementation of AVG aggregates calculates at sum and sum of squares 
even if last one is not needed for AVG.

The comment in the code says:

 * It might seem attractive to optimize this by having multiple accumulator
 * functions that only calculate the sums actually needed.  But on most
 * modern machines, a couple of extra floating-point multiplies will be
 * insignificant compared to the other per-tuple overhead, so I've chosen
 * to minimize code space instead.

And it is true!
In the addition to the results above I can add AVG timing for AVG 
calculation:


postgres=# select avg(l_quantity) from lineitem_projection;
   avg
--
 25.5015621964919
(1 row)

postgres=# select avg(l_quantity::float8) from lineitem_projection;
   avg
--
 25.5015621964919
(1 row)

Please notice that avg for float is calculated using float4_accum which 
use float8 accumulator and also calculates sumX2!



Time: 6103.807 ms (00:06.104)



So I do not see reasonable arguments here for using float4pl for 
sum(float4)!

And I do not know any database which has such strange behavior.
I know that "be as others" or especially "be as Oracle" are never good 
argument for Postgres community but doing something differently (and 
IMHO  wrong) without any significant reasons seems to be very strange.



--
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



  1   2   3   >