Re: [Proposal] Add accumulated statistics for wait event

2020-01-15 Thread Imai Yoshikazu
On 2020/01/13 4:11, Pavel Stehule wrote:
> The following review has been posted through the commitfest application:
> make installcheck-world:  tested, passed
> Implements feature:   tested, passed
> Spec compliant:   not tested
> Documentation:tested, passed
> 
> I like this patch, because I used similar functionality some years ago very 
> successfully. The implementation is almost simple, and the result should be 
> valid by used method.

Thanks for your review!

> The potential problem is performance impact. Very early test show impact cca 
> 3% worst case, but I'll try to repeat these tests.

Yes, performance impact is the main concern. I want to know how it 
affects performance in various test cases or on various environments.

> There are some ending whitespaces and useless tabs.
> 
> The new status of this patch is: Waiting on Author
I attach v4 patches removing those extra whitespaces of the end of lines 
and useless tabs.

--
Yoshikazu Imai
From b009b1f8f6be47ae61b5e4538e2730d721ee60db Mon Sep 17 00:00:00 2001
From: "imai.yoshikazu" 
Date: Wed, 15 Jan 2020 09:13:19 +
Subject: [PATCH v4 1/2] Add pg_stat_waitaccum view.

pg_stat_waitaccum shows counts and duration of each wait events.
Each backend/backgrounds counts and measures the time of wait event
in every pgstat_report_wait_start and pgstat_report_wait_end. They
store those info into their local variables and send to Statistics
Collector. We can get those info via Statistics Collector.

For reducing overhead, I implemented statistic hash instead of
dynamic hash. I also implemented track_wait_timing which
determines wait event duration is collected or not.

On windows, this function might be not worked correctly, because
now it initializes local variables in pg_stat_init which is not
passed to fork processes on windows.
---
 src/backend/catalog/system_views.sql  |   8 +
 src/backend/postmaster/pgstat.c   | 344 ++
 src/backend/storage/lmgr/lwlock.c |  19 ++
 src/backend/utils/adt/pgstatfuncs.c   |  80 ++
 src/backend/utils/misc/guc.c  |   9 +
 src/backend/utils/misc/postgresql.conf.sample |   1 +
 src/include/catalog/pg_proc.dat   |   9 +
 src/include/pgstat.h  | 123 -
 src/include/storage/lwlock.h  |   1 +
 src/include/storage/proc.h|   1 +
 src/test/regress/expected/rules.out   |   5 +
 11 files changed, 598 insertions(+), 2 deletions(-)

diff --git a/src/backend/catalog/system_views.sql 
b/src/backend/catalog/system_views.sql
index 773edf8..80f2caa 100644
--- a/src/backend/catalog/system_views.sql
+++ b/src/backend/catalog/system_views.sql
@@ -957,6 +957,14 @@ CREATE VIEW pg_stat_bgwriter AS
 pg_stat_get_buf_alloc() AS buffers_alloc,
 pg_stat_get_bgwriter_stat_reset_time() AS stats_reset;
 
+CREATE VIEW pg_stat_waitaccum AS
+SELECT
+   S.wait_event_type AS wait_event_type,
+   S.wait_event AS wait_event,
+   S.calls AS calls,
+   S.times AS times
+   FROM pg_stat_get_waitaccum(NULL) AS S;
+
 CREATE VIEW pg_stat_progress_vacuum AS
 SELECT
 S.pid AS pid, S.datid AS datid, D.datname AS datname,
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 51c486b..08e10ad 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -123,6 +123,7 @@
  */
 bool   pgstat_track_activities = false;
 bool   pgstat_track_counts = false;
+bool   pgstat_track_wait_timing = false;
 intpgstat_track_functions = TRACK_FUNC_OFF;
 intpgstat_track_activity_query_size = 1024;
 
@@ -153,6 +154,10 @@ static time_t last_pgstat_start_time;
 
 static bool pgStatRunningInCollector = false;
 
+WAHash *wa_hash;
+
+instr_time waitStart;
+
 /*
  * Structures in which backends store per-table info that's waiting to be
  * sent to the collector.
@@ -255,6 +260,7 @@ static int  localNumBackends = 0;
  */
 static PgStat_ArchiverStats archiverStats;
 static PgStat_GlobalStats globalStats;
+static PgStat_WaitAccumStats waitAccumStats;
 
 /*
  * List of OIDs of databases we need to write out.  If an entry is InvalidOid,
@@ -280,6 +286,8 @@ static pid_t pgstat_forkexec(void);
 #endif
 
 NON_EXEC_STATIC void PgstatCollectorMain(int argc, char *argv[]) 
pg_attribute_noreturn();
+static void pgstat_init_waitaccum_hash(WAHash **hash);
+static PgStat_WaitAccumEntry *pgstat_add_wa_entry(WAHash *hash, uint32 key);
 static void pgstat_beshutdown_hook(int code, Datum arg);
 
 static PgStat_StatDBEntry *pgstat_get_db_entry(Oid databaseid, bool create);
@@ -287,8 +295,11 @@ static PgStat_StatTabEntry 
*pgstat_get_tab_entry(PgStat_StatDBEntry *dbentry,

 Oid tableoid, bool create);
 static void pgstat_write_s

Re: Planning counters in pg_stat_statements (using pgss_store)

2019-09-08 Thread Imai Yoshikazu
Hello

On 2019/09/06 23:19, Tomas Vondra wrote:
> On Wed, Sep 04, 2019 at 07:19:47PM +0300, Sergei Kornilov wrote:
>>
>> ...
>>
>> Results:
>>
>>     test |   mode   | average_tps | degradation_perc
>> --+--+-+--
>> head_no_pgss | extended |   13816 |    1.000
>> patch_not_loaded | extended |   13755 |    0.996
>> head_track_none  | extended |   13607 |    0.985
>> patch_track_none | extended |   13560 |    0.981
>> head_track_top   | extended |   13277 |    0.961
>> patch_track_top  | extended |   13189 |    0.955
>> patch_track_planning | extended |   12983 |    0.940
>> head_no_pgss | prepared |   29101 |    1.000
>> head_track_none  | prepared |   28510 |    0.980
>> patch_track_none | prepared |   28481 |    0.979
>> patch_not_loaded | prepared |   28382 |    0.975
>> patch_track_planning | prepared |   28046 |    0.964
>> head_track_top   | prepared |   28035 |    0.963
>> patch_track_top  | prepared |   27973 |    0.961
>> head_no_pgss | simple   |   16733 |    1.000
>> patch_not_loaded | simple   |   16552 |    0.989
>> head_track_none  | simple   |   16452 |    0.983
>> patch_track_none | simple   |   16365 |    0.978
>> head_track_top   | simple   |   15867 |    0.948
>> patch_track_top  | simple   |   15820 |    0.945
>> patch_track_planning | simple   |   15739 |    0.941
>>
>> So I found slight slowdown with track_planning = off compared to HEAD. 
>> Possibly just at the level of measurement error. I think this is ok.
>> track_planning = on also has no dramatic impact. In my opinion 
>> proposed design with pgss_store call is acceptable.
>>
> 
> FWIW I've done some benchmarking on this too, with a single pgbench client
> running select-only test on a tiny database, in different modes (simple,
> extended, prepared). I've done that on two systems with different CPUs
> (spreadsheet with results attached).

Refering to Sergei's results, if a user, currently using pgss with 
tracking execute time, uses the new feature, a user will see 0~2.2% 
performance regression as below.

 >> head_track_top   | extended |   13277 |0.961
 >> patch_track_planning | extended |   12983 |0.940
 >> patch_track_planning | prepared |   28046 |0.964
 >> head_track_top   | prepared |   28035 |0.963
 >> head_track_top   | simple   |   15867 |0.948
 >> patch_track_planning | simple   |   15739 |0.941

If a user will not turn on the track_planning, a user will see 0.2-0.6% 
performance regression as below.

 >> head_track_top   | extended |   13277 |0.961
 >> patch_track_top  | extended |   13189 |0.955
 >> head_track_top   | prepared |   28035 |0.963
 >> patch_track_top  | prepared |   27973 |0.961
 >> head_track_top   | simple   |   15867 |0.948
 >> patch_track_top  | simple   |   15820 |0.945

> 
> I don't see any performance regression - there are some small variations
> in both directions (say, ~1%) but that's well within the noise. So I think
> the patch is fine in this regard.

+1


I also saw the codes and have one comment.

[0002 patch]
In pgss_planner_hook:

+   /* calc differences of buffer counters. */
+   bufusage = compute_buffer_counters(bufusage_start, 
pgBufferUsage);
+
+   /*
+* we only store planning duration, query text has been 
initialized
+* during previous pgss_post_parse_analyze as it not available 
inside
+* pgss_planner_hook.
+*/
+   pgss_store(query_text,

Do we need to calculate bufusage in here?
We only store planning duration in the following pgss_store.

--
Yoshikazu Imai


RE: seems like a bug in pgbench -R

2019-07-24 Thread Imai, Yoshikazu
On Wed, July 24, 2019 at 7:02 PM, Fabien COELHO wrote:
> > but I have one question. Is it better adding any check like if(maxsock
> > != -1) before the select?
> >
> > else/* no explicit delay, select without timeout */
> > {
> >nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL); }
> 
> I think that it is not necessary because this case cannot happen: If some
> clients are still running (remains > 0), they are either sleeping, in
> which case there would be a timeout, or they are waiting for something
> from the server, otherwise the script could be advanced further so there
> would be something else to do for the thread.

Ah, I understand.

> We could check this by adding "Assert(maxsock != -1);" before this select,
> but I would not do that for a released version.

Yeah I also imagined that we can use Assert, but ah, it's released version.
I got it. Thanks for telling that.

So I'll mark this ready for committer.

--
Yoshikazu Imai





RE: seems like a bug in pgbench -R

2019-07-24 Thread Imai, Yoshikazu
Hi Fabien,

On Fri, Mar 15, 2019 at 4:17 PM, Fabien COELHO wrote:
> >> echo 'select 1' > select.sql
> >> 
> >> while /bin/true; do
> >>  pgbench -n -f select.sql -R 1000 -j 8 -c 8 -T 1 > /dev/null 2>&1;
> >>  date;
> >> done;
> >
> > Indeed. I'll look at it over the weekend.
> >
> >> So I guess this is a bug in 12788ae49e1933f463bc59a6efe46c4a01701b76, or
> >> one of the other commits touching this part of the code.
> 
> I could not reproduce this issue on head, but I confirm on 11.2.

I could reproduce the stuck on 11.4.

On Sat, Mar 16, 2019 at 10:14 AM, Fabien COELHO wrote:
> Attached is a fix to apply on pg11.

I confirm the stuck doesn't happen after applying your patch.

It passes make check-world.

This change seems not to affect performance, so I didn't do any performance
test.

> + /* under throttling we may have finished the last client above 
> */
> + if (remains == 0)
> + break;

If there are only CSTATE_WAIT_RESULT, CSTATE_SLEEP or CSTATE_THROTTLE clients,
a thread needs to wait the results or sleep. In that logic, there are the case
that a thread tried to wait the results when there are no clients wait the
results, and this causes the issue. This is happened when there are only
CSTATE_THROTLE clients and pgbench timeout is occured. Those clients will be
finished and "remains" will be 0.

I confirmed above codes prevent such a case.


I almost think this is ready for committer, but I have one question.

Is it better adding any check like if(maxsock != -1) before the select?


else/* no explicit delay, select without timeout */
{
nsocks = select(maxsock + 1, &input_mask, NULL, NULL, NULL);
}

--
Yoshikazu Imai




RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Imai, Yoshikazu
On Fri, Apr 5, 2019 at 0:05 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> That's strange...  Peter, Imai-san, can you compare your test procedures?

Just for make sure, I described my test procedures in detail.

I install and setup HEAD and patched as follows.

[HEAD(413ccaa)]
(git pull)
./configure --prefix=/usr/local/pgsql-dev --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-dev/bin:$PATH
initdb -D /var/lib/pgsql/data-dev
vi /var/lib/pgsql/data-dev/postgresql.conf

port = 44201
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

pg_ctl -D /var/lib/pgsql/data-dev start



[HEAD(413ccaa) + patch]
(git pull)
patch -u -p1 < 0002.patch
./configure --prefix=/usr/local/pgsql-locallock --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-locallock/bin:$PATH
initdb -D /var/lib/pgsql/data-locallock
vi /var/lib/pgsql/data-locallock/postgresql.conf

port = 44301
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

pg_ctl -D /var/lib/pgsql/data-locallock start


And I tested as follows.

(creating partitioned table for port 44201)
(creating partitioned table for port 44301)
(creating select4096.sql)
for i in `seq 1 5`; do
  pgbench -n -f select4096.sql -T 60 -M prepared -p 44201 | grep including;
  pgbench -n -f select4096.sql -T 60 -M prepared -p 44301 | grep including;
done
tps = 8146.039546 (including connections establishing)
tps = 9021.340872 (including connections establishing)
tps = 8011.186017 (including connections establishing)
tps = 8926.191054 (including connections establishing)
tps = 8006.769690 (including connections establishing)
tps = 9028.716806 (including connections establishing)
tps = 8057.709961 (including connections establishing)
tps = 9017.713714 (including connections establishing)
tps = 7956.332863 (including connections establishing)
tps = 9126.650533 (including connections establishing)


Thanks
--
Yoshikazu Imai


RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Imai, Yoshikazu
On Fri, Apr 5, 2019 at 1:31 AM, Amit Langote wrote:
> On 2019/04/05 5:42, Peter Eisentraut wrote:
> > On 2019-04-04 06:58, Amit Langote wrote:
> >> Also, since the "speed up partition planning" patch went in
> >> (428b260f8), it might be possible to see the performance boost even
> >> with the partitioning example you cited upthread.
> >
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> I was able to detect it as follows.
> 
> * partitioned table setup:
> 
> $ cat ht.sql
> drop table ht cascade;
> create table ht (a int primary key, b int, c int) partition by hash (a);
> select 'create table ht' || x::text || ' partition of ht for values with
> (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
> 8191) x;
> \gexec
> 
> * pgbench script:
> 
> $ cat select.sql
> \set param random(1, 8192)
> select * from ht where a = :param
> 
> * pgbench (5 minute run with -M prepared)
> 
> pgbench -n -M prepared -T 300 -f select.sql
> 
> * tps:
> 
> plan_cache_mode = auto
> 
>HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>HEAD: 2402 tps
> Patched: 2393 tps

Amit-san, thanks for testing this.

I also re-ran my tests(3/19) with HEAD(413ccaa) and HEAD(413ccaa) + patched, 
and I can still detect the performance difference with plan_cache_mode = auto.

Thanks
--
Yoshikazu Imai 



Re: speeding up planning with partitions

2019-03-30 Thread Imai Yoshikazu
On 2019/03/31 1:06, Amit Langote wrote:
 > On Sun, Mar 31, 2019 at 12:11 AM Tom Lane  wrote:
 >> Amit Langote  writes:
 >>> I think the performance results did prove that degradation due to
 >>> those loops over part_rels becomes significant for very large
 >>> partition counts.  Is there a better solution than the bitmapset that
 >>> you have in mind?
 >>
 >> Hm, I didn't see much degradation in what you posted in
 >> <5c83dbca-12b5-1acf-0e85-58299e464...@lab.ntt.co.jp>.
 >
 > Sorry that I didn't mention the link to begin with, but I meant to
 > point to numbers that I reported on Monday this week.
 >
 > 
https://www.postgresql.org/message-id/19f54c17-1619-b228-10e5-ca343be6a4e8%40lab.ntt.co.jp
 >
 > You were complaining of the bitmapset being useless overhead for small
 > partition counts, but the numbers I get tend to suggest that any
 > degradation in performance is within noise range, whereas the
 > performance benefit from having them looks pretty significant for very
 > large partition counts.
 >
 >> I am curious as to why there seems to be more degradation
 >> for hash cases, as per Yoshikazu-san's results in
 >> <0F97FA9ABBDBE54F91744A9B37151A512BAC60@g01jpexmbkw24>,
 >> but whatever's accounting for the difference probably
 >> is not that.
 >
 > I suspected it may have been the lack of bitmapsets, but maybe only
 > Imai-san could've confirmed that by applying the live_parts patch too.

Yeah, I forgot to applying live_parts patch. I did same test again which 
I did for hash before.
(BTW, thanks for committing speeding up patches!)

[HEAD(428b260)]
npartsTPS
==  =
2:  13134 (13240, 13290, 13071, 13172, 12896)
1024:   12627 (12489, 12635, 12716, 12732, 12562)
8192:   10289 (10216, 10265, 10171, 10278, 10514)

[HEAD(428b260) + live_parts.diff]
npartsTPS
==  =
2:  13277 (13112, 13290, 13241, 13360, 13382)
1024:   12821 (12930, 12849, 12909, 12700, 12716)
8192:   11102 (11134, 11158, 4, 10997, 11109)


Degradations of performance are below.


My test results from above (with live_parts, HEAD(428b260) + 
live_parts.diff)
nparts   live_parts   HEAD
==   ==   
2:13277  13134
1024: 12821  12627
8192: 11102  10289

11102/13277 = 83.6 %


Amit-san's test results (with live_parts)
 > npartsv38   HEAD
 > ==      
 > 22971   2969
 > 82980   1949
 > 32   2955733
 > 128  2946145
 > 512  2924 11
 > 1024 2986  3
 > 4096 2702  0
 > 8192 2531OOM

2531/2971 = 85.2 %


My test results I posted before (without live_parts)
 > npartsv38   HEAD
 > ==      
 > 0:  10538  10487
 > 2:   6942   7028
 > 4:   7043   5645
 > 8:   6981   3954
 > 16:  6932   2440
 > 32:  6897   1243
 > 64:  6897309
 > 128: 6753120
 > 256: 6727 46
 > 512: 6708 12
 > 1024:    6063  3
 > 2048:5894  1
 > 4096:5374OOM
 > 8192:4572OOM

4572/6942 = 65.9 %


Certainly, using bitmapset contributes to the performance when scanning 
one partition(few partitions) from large partitions.


Thanks
--
Imai Yoshikazu
diff --git a/src/backend/optimizer/path/joinrels.c 
b/src/backend/optimizer/path/joinrels.c
index 34cc7da..e847655 100644
--- a/src/backend/optimizer/path/joinrels.c
+++ b/src/backend/optimizer/path/joinrels.c
@@ -1516,6 +1516,9 @@ try_partitionwise_join(PlannerInfo *root, RelOptInfo 
*rel1, RelOptInfo *rel2,
populate_joinrel_with_paths(root, child_rel1, child_rel2,

child_joinrel, child_sjinfo,

child_restrictlist);
+   if(!IS_DUMMY_REL(child_joinrel))
+   joinrel->live_parts = 
bms_add_member(joinrel->live_parts,
+   
cnt_parts);
}
 }
 
diff --git a/src/backend/optimizer/plan/planner.c 
b/src/backend/optimizer/plan/planner.c
index f08f1cd..9ddf42a 100644
--- a/src/backend/optimizer/plan/planner.c
+++ b/src/backend/optimizer/plan/planner.c
@@ -7107,7 +7107,9 @@ apply_scanjoin_target_to_paths(PlannerInfo *root,
int partition_idx;
 
/* Adjust each partition. */
-   for (partition_idx = 0; partition_idx < rel->nparts; 
partition_idx++)
+   partition_idx = -1;
+   while ((partition_idx = bms_next_member(rel->live_parts,
+   

RE: speeding up planning with partitions

2019-03-29 Thread Imai, Yoshikazu
On Fri, Mar 29, 2019 at 3:45 PM, Amit Langote wrote:
> Thanks a lot for hacking on the patch.  I'm really happy with the direction
> you took for inheritance_planner, as it allows UPDATE/DELETE to use
> partition pruning.

I was astonished by Tom's awesome works and really thanks him.

> Certainly.  Note that previously we'd always scan *all* hash partitions
> for UPDATE and DELETE queries, because constraint exclusion can't exclude
> hash partitions due to the shape of their partition constraint.
> 
> I ran my usual benchmark with up to 8192 partitions.
> 
> N: 2..8192
> 
> create table rt (a int, b int, c int) partition by range (a); select 'create
> table rt' || x::text || ' partition of rt for values from (' || (x)::text
> || ') to (' || (x+1)::text || ');' from generate_series(1,
> N) x;
> \gexec
> 
> update.sql:
> 
> \set param random(1, N)
> update rt set a = 0 where a = :param;
> 
> pgbench -n -T 120 -f select.sql
> 
> npartsv38   HEAD
> ==      
> 2  2971   2969
> 8  2980   1949
> 32 2955733
> 1282946145
> 5122924 11
> 1024   2986  3
> 4096   2702  0
> 8192   2531OOM
> 
> Obviously, you'll get similar numbers with hash or list partitioning.

I also ran the test for hash partitioning for just make sure.


N: 2..8192

create table ht (a int, b int, c int) partition by hash (a);
select 'create table ht' || x::text ||
' partition of ht for values with (MODULUS N, REMAINDER || (x)::text || ');'
from generate_series(0, N-1) x;
\gexec

update.sql:

\set param random(1, N * 100)
update ht set b = b + 1 where a = :param;

pgbench -n -T 60 -f update.sql


[updating one partition]
npartsv38   HEAD
==      
0:  10538  10487
2:   6942   7028
4:   7043   5645
8:   6981   3954
16:  6932   2440
32:  6897   1243
64:  6897309
128: 6753120
256: 6727 46
512: 6708 12
1024:6063  3
2048:5894  1
4096:    5374    OOM
8192:4572OOM


The performance for hash is also improved, though drop rate of performance with 
large partitions seems higher than that of range partitioning.

Thanks
--
Imai Yoshikazu





RE: speeding up planning with partitions

2019-03-26 Thread Imai, Yoshikazu
Amit-san,

On Tue, Mar 26, 2019 at 7:17 AM, Amit Langote wrote:
> Rebased patches attached.

I could only do the code review of v36-0001.

On Mon, Mar 25, 2019 at 11:35 AM, Amit Langote wrote:
> On 2019/03/23 6:07, Tom Lane wrote:
> > Amit Langote  writes:
> >> [ v34 patch set ]
> > 
> > I had a bit of a look through this.  I went ahead and pushed 0008 and
> > 0009, as they seem straightforward and independent of the rest.  (BTW,
> > 0009 makes 0003's dubious optimization in set_relation_partition_info
> > quite unnecessary.)  As for the rest:
> > 
> > 0001: OK in principle, but I wonder why you implemented it by adding
> > another recursive scan of the jointree rather than just iterating
> > over the baserel array, as in make_one_rel() for instance.  Seems
> > like this way is more work, and more code that we'll have to touch
> > if we ever change the jointree representation.
> 
> I've changed this to work by iterating over baserel array.  I was mostly
> worried about looping over potentially many elements of baserel array that
> we simply end up skipping, but considering the other patch that delays
> adding inheritance children, we don't need to worry about that.
>
> > I also feel like you used a dartboard while deciding where to insert the
> > call in query_planner(); dropping it into the middle of a sequence of
> > equivalence-class-related operations seems quite random.  Maybe we could
> > delay that step all the way to just before make_one_rel, since the other
> > stuff in between seems to only care about baserels?  For example,
> > it'd certainly be better if we could run remove_useless_joins before
> > otherrel expansion, so that we needn't do otherrel expansion at all
> > on a table that gets thrown away as being a useless join.
> 
> create_lateral_join_info() expects otherrels to be present to propagate
> the information it creates.
> 
> I have moved add_other_rels_to_query() followed by
> create_lateral_join_info() to just before make_one_rel().

I checked 0001 patch modifies the thing which is discussed above correctly.

What problem I only found is a little typo.
s/propgated/propagated/

I'm really sorry for my shortage of time to review for now...

--
Yoshikazu Imai


RE: speeding up planning with partitions

2019-03-24 Thread Imai, Yoshikazu
On Fri, Mar 22, 2019 at 9:07 PM, Tom Lane wrote:
> BTW, it strikes me that we could take advantage of the fact that baserels
> must all appear before otherrels in the rel array, by having loops over
> that array stop early if they're only interested in baserels.  We could
> either save the index of the last baserel, or just extend the loop logic
> to fall out upon hitting an otherrel.
> Seems like this'd save some cycles when there are lots of partitions,
> although perhaps loops like that would be fast enough to not matter.

Actually, this speeds up planning time little when scanning a lot of otherrels 
like selecting thousands of partitions. I tested below.

* rt with 8192 partitions
* execute "select * from rt;" for 60 seconds.

[results]
HEAD: 4.19 TPS (4.06, 4.34, 4.17)
(v34 patches) + (looping over only baserels): 4.26 TPS (4.31, 4.28, 4.19)


Attached is the diff I used for this test.

--
Yoshikazu Imai



looping-over-last-baserel-idx.diff
Description: looping-over-last-baserel-idx.diff


Re: speeding up planning with partitions

2019-03-21 Thread Imai Yoshikazu
Hi Jesper,

On 2019/03/20 23:25, Jesper Pedersen wrote:> Hi,
 >
 > On 3/19/19 11:15 PM, Imai, Yoshikazu wrote:
 >> Here the details.
 >>
 >> [creating partitioned tables (with 1024 partitions)]
 >> drop table if exists rt;
 >> create table rt (a int, b int, c int) partition by range (a);
 >> \o /dev/null
 >> select 'create table rt' || x::text || ' partition of rt for values
 >> from (' ||
 >>   (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1,
 >> 1024) x;
 >> \gexec
 >> \o
 >>
 >> [select1024.sql]
 >> \set a random (1, 1024)
 >> select * from rt where a = :a;
 >>
 >> [pgbench]
 >> pgbench -n -f select1024.sql -T 60
 >>
 >>
 >
 > My tests - using hash partitions - shows that the extra time is spent in
 > make_partition_pruneinfo() for the relid_subplan_map variable.
 >
 >64 partitions: make_partition_pruneinfo() 2.52%
 > 8192 partitions: make_partition_pruneinfo() 5.43%
 >
 > TPS goes down ~8% between the two. The test setup is similar to the 
above.
 >
 > Given that Tom is planning to change the List implementation [1] in 13 I
 > think using the palloc0 structure is ok for 12 before trying other
 > implementation options.
 >
 > perf sent off-list.
 >
 > [1] 
https://www.postgresql.org/message-id/24783.1551568303%40sss.pgh.pa.us
 >
 > Best regards,
 >   Jesper
 >
 >

Thanks for testing.
Yeah, after code looking, I think bottleneck seems be lurking in another 
place where this patch doesn't change. I also think the patch is ok as 
it is for 12, and this problem will be fixed in 13.

--
Yoshikazu Imai


RE: speeding up planning with partitions

2019-03-20 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 20, 2019 at 9:07 AM, Amit Langote wrote:
> On 2019/03/20 17:36, Imai, Yoshikazu wrote:
> > On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote:
> >> On 2019/03/20 12:15, Imai, Yoshikazu wrote:
> >>> [select1024.sql]
> >>> \set a random (1, 1024)
> >>> select * from rt where a = :a;
> >>>
> >>> [pgbench]
> >>> pgbench -n -f select1024.sql -T 60
> >>
> >> Thank you.
> >>
> >> Could you please try with running pgbench for a bit longer than 60
> seconds?
> >
> > I run pgbench for 180 seconds but there are still difference.
> 
> Thank you very much.
> 
> > 1024: 7,004 TPS
> > 8192: 5,859 TPS
> >
> >
> > I also tested for another number of partitions by running pgbench for
> 60 seconds.
> >
> > num of partTPS
> > ---  -
> > 128  7,579
> > 256  7,528
> > 512  7,512
> > 1024 7,257 (7274, 7246, 7252)
> > 2048 6,718 (6627, 6780, 6747)
> > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
> > 8192 6,008 (6018, 5999, 6007)
> >
> >
> > I checked whether there are the process which go through the number
> of partitions, but I couldn't find. I'm really wondering why this
> degradation happens.
> 
> Indeed, it's quite puzzling why.  Will look into this.

I don't know whether it is useful, but I noticed the usage of get_tabstat_entry 
increased when many partitions are scanned.

--
Yoshikazu Imai 



RE: speeding up planning with partitions

2019-03-20 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 20, 2019 at 8:21 AM, Amit Langote wrote:
> On 2019/03/20 12:15, Imai, Yoshikazu wrote:
> > Here the details.
> >
> > [creating partitioned tables (with 1024 partitions)] drop table if
> > exists rt; create table rt (a int, b int, c int) partition by range
> > (a); \o /dev/null select 'create table rt' || x::text || ' partition
> > of rt for values from (' ||  (x)::text || ') to (' || (x+1)::text ||
> > ');' from generate_series(1, 1024) x; \gexec \o
> >
> > [select1024.sql]
> > \set a random (1, 1024)
> > select * from rt where a = :a;
> >
> > [pgbench]
> > pgbench -n -f select1024.sql -T 60
> 
> Thank you.
> 
> Could you please try with running pgbench for a bit longer than 60 seconds?

I run pgbench for 180 seconds but there are still difference.

1024: 7,004 TPS
8192: 5,859 TPS


I also tested for another number of partitions by running pgbench for 60 
seconds.

num of partTPS
---  -
128  7,579
256  7,528
512  7,512
1024 7,257 (7274, 7246, 7252)
2048 6,718 (6627, 6780, 6747)
4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
8192 6,008 (6018, 5999, 6007)


I checked whether there are the process which go through the number of 
partitions, but I couldn't find. I'm really wondering why this degradation 
happens.

Yoshikazu Imai 



RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 20, 2019 at 3:01 PM, Amit Langote wrote:
> > On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote:
> >> On 2019/03/20 11:21, Imai, Yoshikazu wrote:
> >>> (4)
> >>> We expect the performance does not depend on the number of
> >>> partitions
> >> after applying all patches, if possible.
> >>>
> >>> num of partTPS
> >>> ---  -
> >>> 1024 7,257 (7274, 7246, 7252)
> >>> 2048 6,718 (6627, 6780, 6747)
> >>> 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s
> results)
> >>> 8192 6,008 (6018, 5999, 6007)
> >>>
> >>> It seems the performance still depend on the number of partitions.
> >>> At
> >> the moment, I don't have any idea what cause this problem but can we
> >> improve this more?
> >>
> >> I've noticed [1] this kind of degradation when the server is built
> >> with Asserts enabled.  Did you?
> >> ...
> >> [1]
> >>
> https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18
> >> b1770d%40lab.ntt.co.jp
> >
> > No. I did test again from configuring without --enable-cassert but
> problem I mentioned still happens.
> 
> Hmm, OK.  Can you describe your test setup with more details?

Here the details.

[creating partitioned tables (with 1024 partitions)]
drop table if exists rt;
create table rt (a int, b int, c int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x;
\gexec
\o

[select1024.sql]
\set a random (1, 1024)
select * from rt where a = :a;

[pgbench]
pgbench -n -f select1024.sql -T 60


What I noticed so far is that it also might depends on the query. I created 
table with 8192 partitions and did select statements like "select * from a = :a 
(which ranges from 1 to 1024)" and "select * from a = :a (which ranges from 1 
to 8192)", and the results of those were different.

I'll send perf to off-list.

--
Yoshikazu Imai 



RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 20, 2019 at 2:34 AM, Amit Langote wrote:
> On 2019/03/20 11:21, Imai, Yoshikazu wrote:
> > (4)
> > We expect the performance does not depend on the number of partitions
> after applying all patches, if possible.
> >
> > num of partTPS
> > ---  -
> > 1024 7,257 (7274, 7246, 7252)
> > 2048 6,718 (6627, 6780, 6747)
> > 4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
> > 8192 6,008 (6018, 5999, 6007)
> >
> > It seems the performance still depend on the number of partitions. At
> the moment, I don't have any idea what cause this problem but can we improve
> this more?
> 
> I've noticed [1] this kind of degradation when the server is built with
> Asserts enabled.  Did you? 
> ...
> [1]
> https://www.postgresql.org/message-id/a49372b6-c044-4ac8-84ea-90ad18
> b1770d%40lab.ntt.co.jp

No. I did test again from configuring without --enable-cassert but problem I 
mentioned still happens.

--
Yoshikazu Imai 



RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 20, 2019 at 0:42 AM, Amit Langote wrote:
> On 2019/03/19 20:13, Imai, Yoshikazu wrote:
> > Thanks for new patches.
> > I looked over them and there are little comments. 
> >
> > ...
> >
> > I have no more comments about codes other than above :)
> 
> I have fixed all.  Attached updated patches.

Thanks for quick modification.

I did performance test again. I did four tests in the below.
There might be some point we can improve performance more from the results of 
last test in the below.

(1)
v33-0003 slows down queries when there are inherited tables in UPDATE/DELETE's 
FROM clause.
v33-0004 fixes this problem.

* rt with 32 partitions.
* jrt with 32 partitions.
* update rt set c = jrt.c from jrt where rt.b = jrt.b;

patch TPS
-   -
master   66.8 (67.2, 66.8, 66.4)
0003 47.5 (47.2, 47.6, 47.7)
0004 68.8 (69.2, 68.9, 68.4)

It seems fixing the performance problem correctly.

(2)
v33-0005 speeds up UPDATE/DELETE by removing useless copy of child target RTEs 
in adjust_appendrel_attrs().

* rt with 32 partitions.
* update rt set b = b + 1;

patch TPS
-   -
master986 (999, 984, 974)
00051,589 (1608, 1577, 1582)

It seems speeding up the performance as expected.

(3)
v33-0006, 0007, 0008 speeds up the case when few partitions are scanned among 
many partitions.

* rt with 4096 partitions, partitioned by column 'a'.
* select rt where rt.a = :a (:a ranges from 1 to 4096)

patch TPS
-   -
master   22.4 (22.4, 22.5, 22.2)
0005 20.9 (20.9, 21.2, 20.6)
00062,951 (2958, 2958, 2937)
00073,141 (3146, 3141, 3136)
00086,472 (6434, 6565, 6416)

Certainly, it seems patches speed up the performance of the case.

(4)
We expect the performance does not depend on the number of partitions after 
applying all patches, if possible.

num of partTPS
---  -
1024 7,257 (7274, 7246, 7252)
2048 6,718 (6627, 6780, 6747)
4096 6,472 (6434, 6565, 6416) (quoted from above (3)'s results)
8192 6,008 (6018, 5999, 6007)

It seems the performance still depend on the number of partitions. At the 
moment, I don't have any idea what cause this problem but can we improve this 
more?


--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-03-19 Thread Imai, Yoshikazu
Amit-san,

On Tue, Mar 19, 2019 at 6:53 AM, Amit Langote wrote:
> On 2019/03/15 9:33, Imai, Yoshikazu wrote:
> > On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote:
> >>> * In inheritance_planner(), we do ChangeVarNodes() only for
> >>> orig_rtable,
> >> so the codes concatenating lists of append_rel_list may be able to
> be
> >> moved before doing ChangeVarNodes() and then the codes concatenating
> >> lists of rowmarks, rtable and append_rel_list can be in one block (or
> >> one bunch).
> >>
> >> Yeah, perhaps.  I'll check.
> 
> I'm inclined to add source_appinfos to subroot->append_rel_list after
> finishing the ChangeVarNodes(subroot->append_rel_list) step, because if
> there are many entries in source_appinfos that would unnecessarily make
> ChangeVarNodes take longer.

OK, thanks.

> I've attached updated patches.  In the new version, I've moved some code
> from 0004 to 0005 patch, so as to avoid mixing unrelated modifications
> in one patch.  Especially, orig_rtable now only appears after applying
> 0005.
> 
> I appreciate your continued interest in these patches.

Thanks for new patches.
I looked over them and there are little comments.

[0002]
* s/regresion/regression/g
(in commit message.)

[0003]
* I thought "inh flag is it" is "inh flag is set" ...?

+* For RTE_RELATION rangetable entries whose inh flag is it, adjust the


* Below comments are correct when after applying 0004.

+* the query's target relation and no other.  Especially, children of 
any
+* source relations are added when the loop below calls grouping_planner
+* on the *1st* child target relation.

[0004]
* s/contain contain/contain/

+* will contain contain references to the subquery RTEs that 
we've


* s/find them children/find their children/

+* AppendRelInfos needed to find them children, so the 
next

[0006]
* s/recursivly/recursively/
(in commit message)


I have no more comments about codes other than above :)

--
Yoshikazu Imai


RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Imai, Yoshikazu
Hi Tsunakawa-san, Peter

On Tue, Mar 19, 2019 at 7:53 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> > You posted a link to some performance numbers, but I didn't see the
> > test setup explained there.  I'd like to get some more information on
> > this impact of this.  Is there an effect with 100 tables, or do you
> need 10?
> 
> Imai-san, can you tell us the test setup?

Maybe I used this test setup[1].

I tested again with those settings for prepared transactions.
I used Tsunakawa-san's patch for locallock[2] (which couldn't be applied to 
current master so I fixed it) and Amit's v32 patch for speeding up planner[3]. 

[settings]
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

[partitioning table definitions(with 4096 partitions)]
create table rt (a int, b int, c int) partition by range (a);

\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

[select4096.sql]
\set a random(1, 4096)
select a from rt where a = :a;

[pgbench(with 4096 partitions)]
pgbench -n -f select4096.sql -T 60 -M prepared

[results]
  master  locallock  v32  v32+locallock
  --  -  ---  -
auto21.9   22.96,834  7,355
custom  19.7   20.07,415  7,252


[1] 
https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51256276%40g01jpexmbkw24
[2] 
https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FBDFA00%40G01JPEXMBYT05
[3] 
https://www.postgresql.org/message-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439%40lab.ntt.co.jp

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-03-18 Thread Imai, Yoshikazu
Amit-san,

On Mon, Mar 18, 2019 at 9:56 AM, Amit Langote wrote:
> On 2019/03/15 14:40, Imai, Yoshikazu wrote:
> > Amit-san,
> >
> > I have another little comments about v31-patches.
> >
> > * We don't need is_first_child in inheritance_planner().
> >
> > On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
> >> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
> >>> I attached the diff of modification for v26-0003 patch which also
> >> contains some refactoring.
> >>> Please see if it is ok.
> >>
> >> I like the is_first_child variable which somewhat improves
> >> readability, so updated the patch to use it.
> >
> > I noticed that detecting first child query in inheritance_planner()
> is already done by "final_rtable == NIL"
> >
> > /*
> >  * If this is the first non-excluded child, its post-planning
> rtable
> >  * becomes the initial contents of final_rtable; otherwise,
> append
> >  * just its modified subquery RTEs to final_rtable.
> >  */
> > if (final_rtable == NIL)
> >
> > So I think we can use that instead of using is_first_child.
> 
> That's a good suggestion.  One problem I see with that is that
> final_rtable is set only once we've found the first *non-dummy* child.

Ah... I overlooked that.

> So, if all the children except the last one are dummy, we'd end up never
> reusing the source child objects.  Now, if final_rtable was set for the
> first child irrespective of whether it turned out to be dummy or not,
> which it seems OK to do, then we can implement your suggestion.

I thought you mean we can remove or move below code to under setting 
final_rtable.

/*
 * If this child rel was excluded by constraint exclusion, exclude it
 * from the result plan.
 */
if (IS_DUMMY_REL(sub_final_rel))
continue;

It seems logically ok, but I also thought there are the case where useless 
process happens.

If we execute below update statement, final_rtable would be unnecessarily 
expanded by adding placeholder.

* partition table rt with 1024 partitions.
* partition table pt with 2 partitions.
* update rt set c = ss.c from ( select b from pt union all select b + 3 from 
pt) ss where a = 1024 and rt.b = ss.b;


After all, I think it might be better to use is_first_child because the meaning 
of is_first_child and final_rtable is different.
is_first_child == true represents that we currently processing first child 
query and final_rtable == NIL represents we didn't find first non-excluded 
child query.

--
Yoshikazu Imai


RE: proposal: pg_restore --convert-to-text

2019-03-17 Thread Imai, Yoshikazu
On Fri, Mar 15, 2019 at 6:20 AM, Imai, Yoshikazu wrote:
> Upon committing this, we have to care this patch break backwards
> compatibility, but I haven't seen any complaints so far. If there are
> no objections, I will set this patch to ready for committer.

Jose had set this to ready for committer. Thanks.

--
Yoshikazu Imai



RE: proposal: pg_restore --convert-to-text

2019-03-17 Thread Imai, Yoshikazu
On Sat, Mar 16, 2019 at 10:55 PM, Euler Taveira wrote:
> > Is there no need to rewrite the Description in the Doc to state we should
> specify either -d or -f option?
> > (and also it might be better to write if -l option is given, neither
> > -d nor -f option isn't necessarily needed.)
> >
> I don't think so. The description is already there (see "pg_restore can
> operate in two modes..."). I left -l as is which means that 'pg_restore
> -l foo.dump' dumps to standard output and 'pg_restore -f - -l foo.dump'
> has the same behavior).

Ah, I understand it.

> > I think the former one looks like pretty, but which one is preffered?
> >
> I don't have a style preference but decided to change to your suggestion.
> New version attached.

I checked it. It may be a trivial matter, so thanks for taking it consideration.

--
Yoshikazu Imai


RE: proposal: pg_restore --convert-to-text

2019-03-14 Thread Imai, Yoshikazu
Hi Jose,

Sorry for my late reply.

On Wed, Mar 6, 2019 at 10:58 AM, José Arthur Benetasso Villanova wrote:
> On Thu, 28 Feb 2019, Imai, Yoshikazu wrote:
> 
> > Is there no need to rewrite the Description in the Doc to state we should
> specify either -d or -f option?
> > (and also it might be better to write if -l option is given, neither
> > -d nor -f option isn't necessarily needed.)
> 
> Since the default part of text was removed, looks ok to me.

Ah, yeah, after looking again, I also think it's ok.

> > I also have the simple question in the code.
> >
> > I thought the below if-else condition
> >
> > +   if (filename && strcmp(filename, "-") == 0)
> > +   fn = fileno(stdout);
> > +   else if (filename)
> > fn = -1;
> >else if (AH->FH)
> >
> > can also be written by the form below.
> >
> >if (filename)
> >{
> >if(strcmp(filename, "-") == 0)
> >fn = fileno(stdout);
> >else
> >fn = -1;
> >}
> >else if (AH->FH)
> >
> > I think the former one looks like pretty, but which one is preffered?
> 
> Aside the above question, I tested the code against a up-to-date
> repository. It compiled, worked as expected and passed all tests.

It still can be applied to HEAD by cfbot.


Upon committing this, we have to care this patch break backwards compatibility, 
but I haven't seen any complaints so far. If there are no objections, I will 
set this patch to ready for committer.

--
Yoshikazu Imai


RE: speeding up planning with partitions

2019-03-14 Thread Imai, Yoshikazu
Amit-san,

I have another little comments about v31-patches.

* We don't need is_first_child in inheritance_planner().

On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
> > I attached the diff of modification for v26-0003 patch which also
> contains some refactoring.
> > Please see if it is ok.
> 
> I like the is_first_child variable which somewhat improves readability,
> so updated the patch to use it.

I noticed that detecting first child query in inheritance_planner() is already 
done by "final_rtable == NIL"

/*
 * If this is the first non-excluded child, its post-planning rtable
 * becomes the initial contents of final_rtable; otherwise, append
 * just its modified subquery RTEs to final_rtable.
 */
if (final_rtable == NIL)

So I think we can use that instead of using is_first_child.

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-03-14 Thread Imai, Yoshikazu
Hi, David

On Thu, Mar 14, 2019 at 9:04 AM, David Rowley wrote:
> On Thu, 14 Mar 2019 at 21:35, Imai, Yoshikazu
>  wrote:
> > 0007:
> > * This changes some processes using "for loop" to using
> "while(bms_next_member())" which speeds up processing when we scan few
> partitions in one statement, but when we scan a lot of partitions in one
> statement, its performance will likely degraded. I measured the
> performance of both cases.
> > I executed select statement to the table which has 4096 partitions.
> >
> > [scanning 1 partition]
> > Without 0007 : 3,450 TPS
> > With 0007: 3,723 TPS
> >
> > [scanning 4096 partitions]
> > Without 0007 : 10.8 TPS
> > With 0007: 10.5 TPS
> >
> > In the above result, performance degrades 3% in case of scanning 4096
> partitions compared before and after applying 0007 patch. I think when
> scanning a lot of tables, executor time would be also longer, so the
> increasement of planner time would be relatively smaller than it. So we
> might not have to care this performance degradation.
> 
> I think it's better to focus on the fewer partitions case due to the fact
> that execution initialisation time and actual execution are likely to
> take much longer when more partitions are scanned.  I did some work on
> run-time pruning to tune it for this case.  Tom did make a similar argument
> in [1] and I explained my reasoning in [2].

Thanks for quoting these threads.
Actually, I recalled this argument, so I tested this just to make sure. 

> bms_next_member has gotten a good performance boost since then and the
> cases are not exactly the same since the old version the loop in run-time
> pruning checked bms_is_member(), but the fact is, we did end up tuning
> for the few partitions case in the end.

Wow, I didn't know that.

> However, it would be good to see the performance results for
> plan+execution time of say a table with 4k parts looking up a single
> indexed value.  You could have two columns, one that's the partition key
> which allows the pruning to take place, and one that's not and results
> in scanning all partitions. I'll be surprised if you even notice the
> difference between with and without 0007 with the latter case.

So I tested for checking the performance for plan+execution time.

[set up table and indexes]
create table rt (a int, b int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

create index b_idx on rt (b);

insert into rt select a, b from generate_series(1, 4096) a, generate_series(1, 
1000) b;

[select_indexed_values.sql]
\set b random(1, 1000)
select count(*) from rt where b = :b;

[pgbench]
pgbench -n -f select_indexed_values.sql -T 60 postgres

[results]
Without 0007: 3.18 TPS (3.25, 3.13, 3.15)
With 0007:3.21 TPS (3.21, 3.23, 3.18)

From the results, we didn't see the performance degradation in this case. 
Actually, the performance increased 1% before and after applying 0007, but it 
would be just an measurement error.
So, generally, we can think the performance difference of bms_next_member and 
for loop can be absorbed by other processing(execution initialisation and 
actual execution) when scanning many partitions.

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-03-14 Thread Imai, Yoshikazu
Amit-san,

On Thu, Mar 14, 2019 at 9:04 AM, Amit Langote wrote:
> > 0002:
> > * I don't really know that delaying adding resjunk output columns to
> the plan doesn't affect any process in the planner. From the comments
> of PlanRowMark, those columns are used in only the executor so I think
> delaying adding junk Vars wouldn't be harm, is that right?
> 
> I think so.  The only visible change in behavior is that "rowmark" junk
> columns are now placed later in the final plan's targetlist.

ok, thanks.

> > 0003:
> > * Is there need to do CreatePartitionDirectory() if rte->inh becomes
> false?
> >
> > +   else if (rte->rtekind == RTE_RELATION && rte->inh)
> > +   {
> > +   rte->inh = has_subclass(rte->relid);
> > +
> > +   /*
> > +* While at it, initialize the PartitionDesc
> infrastructure for
> > +* this query, if not done yet.
> > +*/
> > +   if (root->glob->partition_directory == NULL)
> > +   root->glob->partition_directory =
> > +
>   CreatePartitionDirectory(CurrentMemoryContext);
> > +   }
> 
> We need to have create "partition directory" before we access a
> partitioned table's PartitionDesc for planning.  These days, we rely
> solely on PartitionDesc to determine a partitioned table's children.  So,
> we need to see the PartitionDesc before we can tell there are zero children
> and set inh to false.  In other words, we need the "partition directory"
> to have been set up in advance.

Ah, I see.

> > 0004:
> > * In addition to commit message, this patch also changes the method
> of adjusting AppendRelInfos which reference to the subquery RTEs, and
> new one seems logically correct.
> 
> Do you mean I should mention it in the patch's commit message?

Actually I firstly thought that it's better to mention it in the patch's commit 
message, but I didn't mean that here. I only wanted to state that I confirmed 
new method of adjusting AppendRelInfos seems logically correct :)

> > * In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable,
> so the codes concatenating lists of append_rel_list may be able to be
> moved before doing ChangeVarNodes() and then the codes concatenating
> lists of rowmarks, rtable and append_rel_list can be in one block (or
> one bunch).
> 
> Yeah, perhaps.  I'll check.
> 
> On 2019/03/14 17:35, Imai, Yoshikazu wrote:> Amit-san,
> > I have done code review of v31 patches from 0004 to 0008.
> >
> > 0004:
> > * s/childern/children
> 
> Will fix.
> 
> > 0005:
> > * This seems reasonable for not using a lot of memory in specific
> > case, although it needs special looking of planner experts.
> 
> Certainly.

Thanks for these.

> > 0006:
> > * The codes initializing/setting RelOptInfo's part_rels looks like a
> > bit complicated, but I didn't come up with any good design so far.
> 
> I guess you mean in add_appendrel_other_rels().  The other option is not
> have that code there and expand partitioned tables freshly for every
> target child, which we got rid of in patch 0004.  But we don't want to
> do that.

Yeah, that's right.

> > 0007:
> > * This changes some processes using "for loop" to using
> > "while(bms_next_member())" which speeds up processing when we scan few
> > partitions in one statement, but when we scan a lot of partitions in
> > one statement, its performance will likely degraded. I measured the
> > performance of both cases.
> > I executed select statement to the table which has 4096 partitions.
> >
> > [scanning 1 partition]
> > Without 0007 : 3,450 TPS
> > With 0007: 3,723 TPS
> >
> > [scanning 4096 partitions]
> > Without 0007 : 10.8 TPS
> > With 0007: 10.5 TPS
> >
> > In the above result, performance degrades 3% in case of scanning 4096
> > partitions compared before and after applying 0007 patch. I think when
> > scanning a lot of tables, executor time would be also longer, so the
> > increasement of planner time would be relatively smaller than it. So
> > we might not have to care this performance degradation.
> 
> Well, live_parts bitmapset is optimized for queries scanning only few
> partitions.  It's true that it's slightly more expensive than a simple
> for loop over part_rels, but not too much as you're also saying.

Thanks for the comments.

--
Yoshikazu Imai


RE: speeding up planning with partitions

2019-03-14 Thread Imai, Yoshikazu
Amit-san,

I have done code review of v31 patches from 0004 to 0008.

0004:
* s/childern/children

0005:
* This seems reasonable for not using a lot of memory in specific case, 
although it needs special looking of planner experts.

0006:
* The codes initializing/setting RelOptInfo's part_rels looks like a bit 
complicated, but I didn't come up with any good design so far.

0007:
* This changes some processes using "for loop" to using 
"while(bms_next_member())" which speeds up processing when we scan few 
partitions in one statement, but when we scan a lot of partitions in one 
statement, its performance will likely degraded. I measured the performance of 
both cases.
I executed select statement to the table which has 4096 partitions.

[scanning 1 partition]
Without 0007 : 3,450 TPS
With 0007: 3,723 TPS

[scanning 4096 partitions]
Without 0007 : 10.8 TPS
With 0007: 10.5 TPS

In the above result, performance degrades 3% in case of scanning 4096 
partitions compared before and after applying 0007 patch. I think when scanning 
a lot of tables, executor time would be also longer, so the increasement of 
planner time would be relatively smaller than it. So we might not have to care 
this performance degradation.

0008:
This seems ok.


--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-03-13 Thread Imai, Yoshikazu
Amit-san,

On Tue, Mar 12, 2019 at 2:34 PM, Amit Langote wrote:
> Thanks for the heads up.
> 
> On Tue, Mar 12, 2019 at 10:23 PM Jesper Pedersen
>  wrote:
> > After applying 0004 check-world fails with the attached. CFBot agrees
> [1].
> 
> Fixed.  I had forgotten to re-run postgres_fdw tests after making a change
> just before submitting.

I have done code review of v31 patches from 0001 to 0004.
I described below what I confirmed or thoughts.

0001: This seems ok.

0002:
* I don't really know that delaying adding resjunk output columns to the plan 
doesn't affect any process in the planner. From the comments of PlanRowMark, 
those columns are used in only the executor so I think delaying adding junk 
Vars wouldn't be harm, is that right?

0003:
* Is there need to do CreatePartitionDirectory() if rte->inh becomes false?

+   else if (rte->rtekind == RTE_RELATION && rte->inh)
+   {
+   rte->inh = has_subclass(rte->relid);
+
+   /*
+* While at it, initialize the PartitionDesc 
infrastructure for
+* this query, if not done yet.
+*/
+   if (root->glob->partition_directory == NULL)
+   root->glob->partition_directory =
+   
CreatePartitionDirectory(CurrentMemoryContext);
+   }

0004:
* In addition to commit message, this patch also changes the method of 
adjusting AppendRelInfos which reference to the subquery RTEs, and new one 
seems logically correct.

* In inheritance_planner(), we do ChangeVarNodes() only for orig_rtable, so the 
codes concatenating lists of append_rel_list may be able to be moved before 
doing ChangeVarNodes() and then the codes concatenating lists of rowmarks, 
rtable and append_rel_list can be in one block (or one bunch).

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-03-10 Thread Imai, Yoshikazu
Amit-san,

On Fri, Mar 8, 2019 at 9:18 AM, Amit Langote wrote:
> On 2019/03/08 16:16, Imai, Yoshikazu wrote:
> > So I modified the code and did test to confirm memory increasement don't
> happen. The test and results are below.
> >
> > [test]
> > * Create partitioned table with 1536 partitions.
> > * Execute "update rt set a = random();"
> >
> > [results]
> > A backend uses below amount of memory in update transaction:
> >
> > HEAD: 807MB
> > With v26-0001, 0002: 790MB
> > With v26-0001, 0002, 0003: 860MB
> > With v26-0003 modified: 790MB
> 
> Can you measure with v28, or better attached v29 (about which more below)?
> 
> > I attached the diff of modification for v26-0003 patch which also
> contains some refactoring.
> > Please see if it is ok.
> 
> I like the is_first_child variable which somewhat improves readability,
> so updated the patch to use it.
> 
> Maybe you know that range_table_mutator() spends quite a long time if
> there are many target children, but I realized there's no need for
> range_table_mutator() to copy/mutate child target RTEs.  First, there's
> nothing to translate in their case.  Second, copying them is not necessary
> too, because they're not modified across different planning cycles.  If
> we pass to adjust_appendrel_attrs only the RTEs in the original range
> table (that is, before any child target RTEs were added), then
> range_table_mutator() has to do significantly less work and allocates
> lot less memory than before.  I've broken this change into its own patch;
> see patch 0004.

Cool!
I tested with v29 patches and checked it saved a lot of memory..

HEAD: 807MB
With v29-0001, 0002, 0003, 0004: 167MB

Maybe planning time in this case is also improved, but I didn't check it.


> but I realized there's no need for
> range_table_mutator() to copy/mutate child target RTEs.  First, there's
> nothing to translate in their case.  Second, copying them is not necessary
> too, because they're not modified across different planning cycles.

Yeah, although I couldn't check the codes in detail, but from the below 
comments in inheritance_planner(), ISTM we need copies of subquery RTEs but 
need not copies of other RTEs in each planning.

/*   
 * We generate a modified instance of the original Query for each target
 * relation, plan that, and put all the plans into a list that will be
 * controlled by a single ModifyTable node.  All the instances share the
 * same rangetable, but each instance must have its own set of subquery
 * RTEs within the finished rangetable because (1) they are likely to get
 * scribbled on during planning, and (2) it's not inconceivable that
 * subqueries could get planned differently in different cases.  We need
 * not create duplicate copies of other RTE kinds, in particular not the
 * target relations, because they don't have either of those issues.  Not
 * having to duplicate the target relations is important because doing so
 * (1) would result in a rangetable of length O(N^2) for N targets, with
 * at least O(N^3) work expended here; and (2) would greatly complicate
 * management of the rowMarks list.


Thanks 
--
Yoshikazu Imai


RE: speeding up planning with partitions

2019-03-07 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote:
...
> > I didn't investigate that problem, but there is another memory
> > increase
> issue, which is because of 0003 patch I think. I'll try to solve the latter
> issue.
> 
> Interested in details as it seems to be a separate problem.

I solved this problem.

I think we don't need to do list_copy in the below code.

+   /*
+* No need to copy of the RTEs themselves, but do copy the List
+* structure.
+*/
+   subroot->parse->rtable = list_copy(rtable_with_target);

Because subroot->parse->rtable will be copied again by:

subroot->parse = (Query *)
adjust_appendrel_attrs(parent_root,
-  (Node *) 
parent_parse,
+  (Node *) 
subroot->parse,
   1, &appinfo);

So I modified the code and did test to confirm memory increasement don't 
happen. The test and results are below.

[test]
* Create partitioned table with 1536 partitions.
* Execute "update rt set a = random();"

[results]
A backend uses below amount of memory in update transaction:

HEAD: 807MB
With v26-0001, 0002: 790MB
With v26-0001, 0002, 0003: 860MB
With v26-0003 modified: 790MB


I attached the diff of modification for v26-0003 patch which also contains some 
refactoring.
Please see if it is ok.
(Sorry it is modification for v26 patch though latest ones are v28.)

--
Yoshikazu Imai 



v26-0003-solving-memory-increasement-problem.diff
Description: v26-0003-solving-memory-increasement-problem.diff


RE: speeding up planning with partitions

2019-03-06 Thread Imai, Yoshikazu
Amit-san,

On Wed, Mar 6, 2019 at 5:38 AM, Amit Langote wrote:
> On 2019/03/06 11:09, Imai, Yoshikazu wrote:
> > [0004 or 0005]
> > There are redundant process in add_appendrel_other_rels (or
> expand_xxx_rtentry()?).
> > I modified add_appendrel_other_rels like below and it actually worked.
> >
> >
> > add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index
> > rti) {
> > ListCell   *l;
> > RelOptInfo *rel;
> >
> > /*
> >  * Add inheritance children to the query if not already done. For
> child
> >  * tables that are themselves partitioned, their children will be
> added
> >  * recursively.
> >  */
> > if (rte->rtekind == RTE_RELATION
> && !root->contains_inherit_children)
> > {
> > expand_inherited_rtentry(root, rte, rti);
> > return;
> > }
> >
> > rel = find_base_rel(root, rti);
> >
> > foreach(l, root->append_rel_list)
> > {
> > AppendRelInfo  *appinfo = lfirst(l);
> > Index   childRTindex = appinfo->child_relid;
> > RangeTblEntry  *childrte;
> > RelOptInfo *childrel;
> >
> > if (appinfo->parent_relid != rti)
> > continue;
> >
> > Assert(childRTindex < root->simple_rel_array_size);
> > childrte = root->simple_rte_array[childRTindex];
> > Assert(childrte != NULL);
> > build_simple_rel(root, childRTindex, rel);
> >
> > /* Child may itself be an inherited relation. */
> > if (childrte->inh)
> > add_appendrel_other_rels(root, childrte, childRTindex);
> > }
> > }
> 
> If you don't intialize part_rels here, then it will break any code in
> the planner that expects a partitioned rel with non-zero number of
> partitions to have part_rels set to non-NULL.  At the moment, that
> includes the code that implements partitioning-specific optimizations
> such partitionwise aggregate and join, run-time pruning etc.  No existing
> regression tests cover the cases where source inherited roles
> participates in those optimizations, so you didn't find anything that
> broke.  For an example, consider the following update query:
> 
> update p set a = p1.a + 1 from p p1 where p1.a = (select 1);
> 
> Planner will create a run-time pruning aware Append node for p (aliased
> p1), for which it will need to use the part_rels array.  Note that p in
> this case is a source relation which the above code initializes.
> 
> Maybe we should add such regression tests.

Ah, now I understand that the codes below of expand_inherited_rtentry() 
initializes part_rels of child queries after first child target and part_rels 
of those are used in partitioning-specific optimizations. Thanks for the 
explanation.

--
Yoshikazu Imai


RE: speeding up planning with partitions

2019-03-05 Thread Imai, Yoshikazu
On Wed, Mar 6, 2019 at 2:10 AM, Imai, Yoshikazu wrote:
> > and Imai-san's review.  I haven't been able to pin down the bug (or
> > whatever) that makes throughput go down as the partition count
> > increases, when tested with a --enable-cassert build.
> 
> I didn't investigate that problem, but there is another memory increase
> issue, which is because of 0003 patch I think. I'll try to solve the latter
> issue.

Ah, I noticed Amit-san identified the cause of problem with --enable-cassert 
build just now.

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-03-05 Thread Imai, Yoshikazu
Amit-san,

On Tue, Mar 5, 2019 at 10:24 AM, Amit Langote wrote:
> On 2019/03/05 9:50, Amit Langote wrote:
> > I'll post the updated patches after diagnosing what I'm suspecting a
> > memory over-allocation bug in one of the patches.  If you configure
> > build with --enable-cassert, you'll see that throughput decreases as
> > number of partitions run into many thousands, but it doesn't when
> > asserts are turned off.
> 
> Attached an updated version.  This incorporates fixes for both Jesper's
> and Imai-san's review.

Thanks for updating patches!
Here is the code review for previous v26 patches.

[0002]
In expand_inherited_rtentry():

expand_inherited_rtentry()
{
...
+   RelOptInfo *rel = NULL;

can be declared at more later:

if (oldrelation->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
...
else
{   
List   *appinfos = NIL;
RangeTblEntry *childrte;
Index   childRTindex;
+RelOptInfo *rel = NULL;


[0003]
In inheritance_planner:

+   rtable_with_target = list_copy(root->parse->rtable);

can be:

+   rtable_with_target = list_copy(parse->rtable);

[0004 or 0005]
There are redundant process in add_appendrel_other_rels (or 
expand_xxx_rtentry()?).
I modified add_appendrel_other_rels like below and it actually worked.


add_appendrel_other_rels(PlannerInfo *root, RangeTblEntry *rte, Index rti) 
{
ListCell   *l;
RelOptInfo *rel;

/*   
 * Add inheritance children to the query if not already done. For child
 * tables that are themselves partitioned, their children will be added
 * recursively.
 */
if (rte->rtekind == RTE_RELATION && !root->contains_inherit_children)
{
expand_inherited_rtentry(root, rte, rti);
return;
}

rel = find_base_rel(root, rti);

foreach(l, root->append_rel_list)
{
AppendRelInfo  *appinfo = lfirst(l);
Index   childRTindex = appinfo->child_relid;
RangeTblEntry  *childrte;
RelOptInfo *childrel;

if (appinfo->parent_relid != rti) 
continue;

Assert(childRTindex < root->simple_rel_array_size);
childrte = root->simple_rte_array[childRTindex];
Assert(childrte != NULL);
build_simple_rel(root, childRTindex, rel);

/* Child may itself be an inherited relation. */
if (childrte->inh)
add_appendrel_other_rels(root, childrte, childRTindex);
}
}

> and Imai-san's review.  I haven't been able to pin down the bug (or
> whatever) that makes throughput go down as the partition count increases,
> when tested with a --enable-cassert build.

I didn't investigate that problem, but there is another memory increase issue, 
which is because of 0003 patch I think. I'll try to solve the latter issue.

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-03-04 Thread Imai, Yoshikazu
Amit-san,

On Tue, Mar 5, 2019 at 0:51 AM, Amit Langote wrote:
> Hi Jesper,
> 
> Thanks for the review.  I've made all of the changes per your comments
> in my local repository.  I'll post the updated patches after diagnosing
> what I'm suspecting a memory over-allocation bug in one of the patches.
> If you configure build with --enable-cassert, you'll see that throughput
> decreases as number of partitions run into many thousands, but it doesn't
> when asserts are turned off.
> 
> On 2019/03/05 1:20, Jesper Pedersen wrote:
> > I'll run some tests using a hash partitioned setup.
> 
> Thanks.

I've also done code review of 0001 and 0002 patch so far.

[0001]
1. Do we need to open/close a relation in add_appendrel_other_rels()? 

+   if (rel->part_scheme)
{
-   ListCell   *l;
...
-   Assert(cnt_parts == nparts);
+   rel->part_rels = (RelOptInfo **)
+   palloc0(sizeof(RelOptInfo *) * rel->nparts);
+   relation = table_open(rte->relid, NoLock);
}

+   if (relation)
+   table_close(relation, NoLock);


2. We can sophisticate the usage of Assert in add_appendrel_other_rels().

+   if (rel->part_scheme)
{
...
+   rel->part_rels = (RelOptInfo **)
+   palloc0(sizeof(RelOptInfo *) * rel->nparts);
+   relation = table_open(rte->relid, NoLock);
}
  ...
+   i  = 0;
+   foreach(l, root->append_rel_list)
+   {
...
+   if (rel->part_scheme != NULL)
+   {
+   Assert(rel->nparts > 0 && i < rel->nparts);
+   rel->part_rels[i] = childrel;
+   }
+
+   i++;

as below;

+   if (rel->part_scheme)
{
...
Assert(rel->nparts > 0);
+   rel->part_rels = (RelOptInfo **)
+   palloc0(sizeof(RelOptInfo *) * rel->nparts);
+   relation = table_open(rte->relid, NoLock);
}
  ...
+   i  = 0;
+   foreach(l, root->append_rel_list)
+   {
...
+   if (rel->part_scheme != NULL)
+   {
+   Assert(i < rel->nparts);
+   rel->part_rels[i] = childrel;
+   }
+
+   i++;


[0002]
3. If using variable like is_source_inh_expansion, the code would be easy to 
read. (I might mistakenly understand root->simple_rel_array_size > 0 means 
source inheritance expansion though.)

In expand_inherited_rtentry() and expand_partitioned_rtentry():

+* Expand planner arrays for adding the child relations.  Can't 
do
+* this if we're not being called from query_planner.
+*/
+   if (root->simple_rel_array_size > 0)
+   {
+   /* Needed when creating child RelOptInfos. */
+   rel = find_base_rel(root, rti);
+   expand_planner_arrays(root, list_length(inhOIDs));
+   }

+   /* Create the otherrel RelOptInfo too. */
+   if (rel)
+   (void) build_simple_rel(root, childRTindex, 
rel);

would be:

+* Expand planner arrays for adding the child relations.  Can't 
do
+* this if we're not being called from query_planner.
+*/
+   if (is_source_inh_expansion)
+   {
+   /* Needed when creating child RelOptInfos. */
+   rel = find_base_rel(root, rti);
+   expand_planner_arrays(root, list_length(inhOIDs));
+   }

+   /* Create the otherrel RelOptInfo too. */
+   if (is_source_inh_expansion)
+   (void) build_simple_rel(root, childRTindex, 
rel);


4. I didn't see much carefully, but should the introduction of 
contains_inherit_children be in 0003 patch...?


I'll continue to do code review of rest patches.


--
Yoshikazu Imai 



RE: speeding up planning with partitions

2019-03-04 Thread Imai, Yoshikazu
Amit-san,

On Fri, Mar 1, 2019 at 1:02 PM, Amit Langote wrote:
> Thanks for testing and sorry it took me a while to reply.

Thanks for working about this late at night. I know you have a lot of things to 
do.

> On 2019/02/25 15:24, Imai, Yoshikazu wrote:
> > [update_pt_with_joining_another_pt.sql]
> > update rt set c = jrt.c + 100 from jrt where rt.b = jrt.b;
> >
> > [pgbench]
> > pgbench -n -f update_pt_with_joining_another_pt_for_ptkey.sql -T 60
> > postgres
> >
> > [results]
> > (part_num_rt, part_num_jrt)  master  patched(0001)
> > ---  --  -
> >   (3, 1024)8.06   5.89
> >   (3, 2048)1.52   0.87
> >   (6, 1024)4.11   1.77
> >
> >
> >
> > With HEAD, we add target inheritance and source inheritance to
> parse->rtable in inheritance_planner and copy and adjust it for child
> tables at beginning of each planning of child tables.
> >
> > With the 0001 patch, we add target inheritance to parse->rtable in
> inheritance_planner and add source inheritance to parse->rtable in
> make_one_rel(under grouping_planner()) during each planning of child
> tables.
> > Adding source inheritance to parse->rtable may be the same process
> between each planning of child tables and it might be useless. OTOH, from
> the POV of making inheritance-expansion-at-bottom better, expanding
> source inheritance in make_one_rel seems correct design to me.
> >
> > How should we do that...?
> 
> To solve this problem, I ended up teaching inheritance_planner to reuse
> the objects for *source* inheritance child relations (RTEs,
> AppendRelInfos, and PlanRowMarks) created during the planning of the 1st
> child query for the planning of subsequent child queries.  Set of source
> child relations don't change between different planning runs, so it's
> okay to do so.  Of course, I had to make sure that query_planner (which
> is not in the charge of adding source inheritance child objects) can notice
> that.

I did above test again with v25 patch and checked the problem is solved.

[results]
(part_num_rt, part_num_jrt)  master  patched(0001)
---  --  -
  (3, 1024)6.11   6.82
  (3, 2048)1.05   1.48
  (6, 1024)3.05   3.45

Sorry that I haven't checked the codes related this problem yet, but I'll check 
it later.


> Please find attached updated patches.  Will update source code comments,
> commit message and perform other fine-tuning over the weekend if possible.

I've taken at glance the codes and there are some minor comments about the 
patch.

* You changed a usage/arguments of get_relation_info, but why you did it? I 
made those codes back to the original code and checked it passes make check. So 
ISTM there are no logical problems with not changing it. Or if you change it, 
how about also change a usage/arguments of get_relation_info_hook in the same 
way?

-get_relation_info(PlannerInfo *root, Oid relationObjectId, bool inhparent,
- RelOptInfo *rel)
+get_relation_info(PlannerInfo *root, RangeTblEntry *rte, RelOptInfo *rel)
 {
+   boolinhparent = rte->inh;
-   relation = table_open(relationObjectId, NoLock);
+   relation = heap_open(rte->relid, NoLock);
  ...
if (get_relation_info_hook)
-   (*get_relation_info_hook) (root, relationObjectId, inhparent, 
rel);
+   (*get_relation_info_hook) (root, rte->relid, rte->inh, rel);


@@ -217,15 +272,13 @@ build_simple_rel(PlannerInfo *root, int relid, RelOptInfo 
*parent)
/* Check type of rtable entry */
switch (rte->rtekind)
{
case RTE_RELATION:
/* Table --- retrieve statistics from the system 
catalogs */
-   get_relation_info(root, rte->relid, rte->inh, rel);
+   get_relation_info(root, rte, rel);


* You moved the codes of initializing of append rel's partitioned_child_rels in 
set_append_rel_size() to build_simple_rel(), but is it better to do? I also 
checked the original code passes make check by doing like above.

@@ -954,32 +948,6 @@ set_append_rel_size(PlannerInfo *root, RelOptInfo *rel,
Assert(IS_SIMPLE_REL(rel));
 
/*
-* Initialize partitioned_child_rels to contain this RT index.
-*
-* Note that during the set_append_rel_pathlist() phase, we will bubble 
up
-* the indexes of partitioned relations that appear down in the tree, so
-* that when we've created Paths for all the children, the root
-* partitioned table

RE: Problem with default partition pruning

2019-02-28 Thread Imai, Yoshikazu
Hosoya-san

On Wed, Feb 27, 2019 at 6:51 AM, Yuzuko Hosoya wrote:
> > From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> > Sent: Wednesday, February 27, 2019 11:22 AM
> >
> > Hosoya-san,
> >
> > On 2019/02/22 17:14, Yuzuko Hosoya wrote:
> > > Hi,
> > >
> > > I found the bug of default partition pruning when executing a range
> query.
> > >
> > > -
> > > postgres=# create table test1(id int, val text) partition by range
> > > (id); postgres=# create table test1_1 partition of test1 for values
> > > from (0) to (100); postgres=# create table test1_2 partition of
> > > test1 for values from (150) to (200); postgres=# create table
> > > test1_def partition of test1 default;
> > >
> > > postgres=# explain select * from test1 where id > 0 and id < 30;
> > >QUERY PLAN
> > > 
> > >  Append  (cost=0.00..11.83 rows=59 width=11)
> > >->  Seq Scan on test1_1  (cost=0.00..5.00 rows=58 width=11)
> > >  Filter: ((id > 0) AND (id < 30))
> > >->  Seq Scan on test1_def  (cost=0.00..6.53 rows=1 width=12)
> > >  Filter: ((id > 0) AND (id < 30))
> > > (5 rows)
> > >
> > > There is no need to scan the default partition, but it's scanned.
> > > -
> > >
> > > In the current implement, whether the default partition is scanned
> > > or not is determined according to each condition of given WHERE
> > > clause at get_matching_range_bounds().  In this example,
> > > scan_default is set true according to id > 0 because id >= 200
> > > matches the default partition.  Similarly, according to id < 30,
> scan_default is set true.
> > > Then, these results are combined according to AND/OR at
> perform_pruning_combine_step().
> > > In this case, final result's scan_default is set true.
> > >
> > > The modifications I made are as follows:
> > > - get_matching_range_bounds() determines only offsets of range bounds
> > >   according to each condition
> > > - These results are combined at perform_pruning_combine_step()
> > > - Whether the default partition is scanned or not is determined at
> > >   get_matching_partitions()
> > >
> > > Attached the patch.  Any feedback is greatly appreciated.
> >
> > Thank you for reporting.  Can you please add this to March CF in Bugs
> > category so as not to lose
> track
> > of this?
> >
> > I will try to send review comments soon.
> >
> Thank you for your reply.  I added this to March CF.

I tested with simple use case and I confirmed it works correctly like below.

In case using between clause:
postgres=# create table test1(id int, val text) partition by range (id); 
postgres=# create table test1_1 partition of test1 for values from (0) to 
(100); 
postgres=# create table test1_2 partition of test1 for values from (150) to 
(200);
postgres=# create table test1_def partition of test1 default; 

[HEAD]
postgres=# explain analyze select * from test1 where id between 0 and 50;
QUERY PLAN  
   
---
 Append  (cost=0.00..58.16 rows=12 width=36) (actual time=0.008..0.008 rows=0 
loops=1)
   ->  Seq Scan on test1_1  (cost=0.00..29.05 rows=6 width=36) (actual 
time=0.005..0.005 rows=0 loops=1)
 Filter: ((id >= 0) AND (id <= 50))
   ->  Seq Scan on test1_def  (cost=0.00..29.05 rows=6 width=36) (actual 
time=0.002..0.002 rows=0 loops=1)
 Filter: ((id >= 0) AND (id <= 50))


[patched]
postgres=# explain analyze select * from test1 where id between 0 and 50;
   QUERY PLAN   
 
-
 Append  (cost=0.00..29.08 rows=6 width=36) (actual time=0.006..0.006 rows=0 
loops=1)
   ->  Seq Scan on test1_1  (cost=0.00..29.05 rows=6 width=36) (actual 
time=0.004..0.005 rows=0 loops=1)
 Filter: ((id >= 0) AND (id <= 50))



I considered about another use case. If default partition contains rows whose 
id = 300 and then we add another partition which have constraints like id >= 
300 and id < 400, I thought we won't scan the rows anymore. But I noticed we 
simply can't add such a partition.

postgres=# insert into test1 values (300);
INSERT 0 1
postgres=# create table test1_3 partition of test1 for values from (300) to 
(400); 
ERROR:  updated partition constraint for default partition "test1_def" would be 
violated by some row


So I haven't come up with bad cases so far :)

--
Yoshikazu Imai 





RE: proposal: pg_restore --convert-to-text

2019-02-27 Thread Imai, Yoshikazu
Hi,

On Tue, Feb 19, 2019 at 8:20 PM, Euler Taveira wrote:
> Em seg, 18 de fev de 2019 às 19:21, Tom Lane  escreveu:
> >
> > Euler Taveira  writes:
> > > Since no one has stepped up, I took a stab at it. It will prohibit
> > > standard output unless '-f -' be specified. -l option also has the
> > > same restriction.
> >
> > Hm, don't really see the need to break -l usage here.
> >
> After thinking about it, revert it.
> 
> > Pls add to next CF, if you didn't already.
> >
> Done.

I saw the patch.

Is there no need to rewrite the Description in the Doc to state we should 
specify either -d or -f option?
(and also it might be better to write if -l option is given, neither -d nor -f 
option isn't necessarily needed.)


I also have the simple question in the code.

I thought the below if-else condition

+   if (filename && strcmp(filename, "-") == 0)
+   fn = fileno(stdout);
+   else if (filename)
fn = -1;
else if (AH->FH)

can also be written by the form below.

if (filename)
{
if(strcmp(filename, "-") == 0)
fn = fileno(stdout);
else
fn = -1;
}
else if (AH->FH)

I think the former one looks like pretty, but which one is preffered?

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-02-24 Thread Imai, Yoshikazu
Hi Amit-san.

On Fri, Feb 22, 2019 at 5:55 PM, Amit Langote wrote:
> 
> Please find attached updated patches.  I've made a few updates in last
> couple of hours such as improving comments, fixing a few thinkos in
> inheritance_planner changes, etc.

Thanks for the patch. 

While doing code review of v24-0001, I found the performance degradation case.

[creating tables]
drop table rt;
create table rt (a int, b int, c int) partition by range (a);
\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 3) x;
\gexec
\o

drop table if exists jrt;
create table jrt (a int, b int, c int) partition by range (a);
\o /dev/null
select 'create table jrt' || x::text || ' partition of jrt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 1024) x;
\gexec
\o

[update_pt_with_joining_another_pt.sql]
update rt set c = jrt.c + 100 from jrt where rt.b = jrt.b;

[pgbench]
pgbench -n -f update_pt_with_joining_another_pt_for_ptkey.sql -T 60 postgres

[results]
(part_num_rt, part_num_jrt)  master  patched(0001)
---  --  -
  (3, 1024)8.06   5.89
  (3, 2048)1.52   0.87
  (6, 1024)4.11   1.77



With HEAD, we add target inheritance and source inheritance to parse->rtable in 
inheritance_planner and copy and adjust it for child tables at beginning of 
each planning of child tables.

With the 0001 patch, we add target inheritance to parse->rtable in 
inheritance_planner and add source inheritance to parse->rtable in 
make_one_rel(under grouping_planner()) during each planning of child tables.
Adding source inheritance to parse->rtable may be the same process between each 
planning of child tables and it might be useless. OTOH, from the POV of making 
inheritance-expansion-at-bottom better, expanding source inheritance in 
make_one_rel seems correct design to me.

How should we do that...?

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-02-18 Thread Imai, Yoshikazu
On Mon, Feb 18, 2019 at 5:28 PM, Tom Lane wrote:
> Frankly, that code is just horrid.  Having a function with side effects
> in an if-test is questionable at the best of times, and having it be the
> second of three conditions (which the third condition silently depends
> on) is unreadable and unmaintainable.

When I reviewed this, I thought there are no problems in the codes, but I 
googled what Tom pointed out[1], read it and I was ashamed of my ignorance.

> I think the existing code here is considerably cleaner than what this
> patch proposes.
> 
> I suppose you are doing this because you intend to jam some additional
> cleanup code into the successfully-pruned-it code path, but if said code
> is really too bulky to have multiple copies of, couldn't you put it into
> a subroutine?

ISTM the 0004 patch eventually removes these codes from multiple places 
(set_append_rel_size and set_inherited_target_rel_sizes) so we might be better 
to not be struggling here?

[1] 
https://www.teamten.com/lawrence/programming/keep-if-clauses-side-effect-free.html

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-02-17 Thread Imai, Yoshikazu
Amit-san

Sorry for my late reply. I had another work to do.

On Fri, Feb 8, 2019 at 9:13 AM, Amit Langote wrote:
> On 2019/02/08 13:44, Imai, Yoshikazu wrote:
> > 3.
> > 0001: line 1919-1920
> >
> > -   case CONSTRAINT_EXCLUSION_ON:
> > -   break;  /* always try
> to exclude */
> >
> > CONSTRAINT_EXCLUSION_ON is no longer used, so should we remove it also
> from guc parameters?
> 
> Well, we haven't removed the "on" setting itself.

Ah, I understand.

> Okay, I've broken down those changes into separate patches, so that
> cleanup hunks are not fixed with other complex changes.
> 
> 0001 is now a patch to remove duplicate code from set_append_rel_size.
> It combines multiple blocks that have the same body doing
> set_dummy_rel_pathlist().
> 
> 0002 is the "overhaul inherited update/delete planning"
> 
> 0003 is a cleanup patch that gets rid of some code that is rendered useless
> due to 0002 (partitioned tables no longer use constraint exclusion)

Thanks for doing these.

> I think 0001 can be committed on its own.

+1.

In commit message:

s/contradictory quals found/contradictory quals are found/
s/child excluded/child is excluded/

I think others in 0001 are ok.

> 0002+0003 can be committed
> together.
>
> 0004-0006 are the patches that were previously 0002-0004.

I will do code review of v22 patches again and send notes as soon as possible.

Yoshikazu Imai



RE: speeding up planning with partitions

2019-02-08 Thread Imai, Yoshikazu
On Fri, Feb 8, 2019 at 1:34 AM, I wrote:
> On Wed, Feb 6, 2019 at 2:04 AM, Tsunakawa, Takayuki wrote:
> > Can you compare the performance of auto and force_custom_plan again
> > with the attached patch?  It uses PGPROC's LOCALLOCK list instead of
> > the hash table.
> 
> Thanks for the patch, but it seems to have some problems.

I just missed compiling.

Performance degradation I saw before is improved! The results are below.

[v20 + faster-locallock-scan.patch]
auto:   9,069 TPS
custom: 9,015 TPS

[v20]
auto:   8,037 TPS
custom: 8,933 TPS

As David and I mentioned this patch should be discussed on another thread, so 
Tsunakawa-san, could you launch the another thread please?

Thanks
--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-02-07 Thread Imai, Yoshikazu
Amit-san,

On Thu, Feb 7, 2019 at 10:22 AM, Amit Langote wrote:
> Rebased over bdd9a99aac.

I did code review of 0001 and I have some suggestions. Could you check them?

1.
0001: line 418
+* add_inherit_target_child_root() would've added only those 
that are

add_inherit_target_child_root() doesn't exist now, so an above comment needs to 
be modified.

2.
0001: line 508-510

In set_inherit_target_rel_pathlists():
+   /* Nothing to do if all the children were excluded. */
+   if (IS_DUMMY_REL(rel))
+   return;

These codes aren't needed or can be replaced by Assert because 
set_inherit_target_rel_pathlists is only called from set_rel_pathlist which 
excutes IS_DUMMY_REL(rel) before calling set_inherit_target_rel_pathlists, as 
follows.

set_rel_pathlist(...)
{
...
if (IS_DUMMY_REL(rel))
{
/* We already proved the relation empty, so nothing more to do */
}
else if (rte->inh)
{
/*
 * If it's a target relation, set the pathlists of children instead.
 * Otherwise, we'll append the outputs of children, so process it as
 * an "append relation".
 */
if (root->inherited_update && root->parse->resultRelation == rti) 
{
inherited_update = true;
set_inherit_target_rel_pathlists(root, rel, rti, rte);

3.
0001: line 1919-1920

-   case CONSTRAINT_EXCLUSION_ON:
-   break;  /* always try to 
exclude */

CONSTRAINT_EXCLUSION_ON is no longer used, so should we remove it also from guc 
parameters?

4.
0001:

Can we remove enum InheritanceKind which is no longer used?


I also see the patch from a perspective of separating codes from 0001 which are 
not essential of Overhaul inheritance update/delete planning. Although almost 
all of codes are related each other, but I found below two things can be moved 
to another patch.

---
0001: line 550-608

This section seems to be just refinement of set_append_rel_size().
So can we separate this from 0001 to another patch?

---
0001: line 812-841, 940-947, 1525-1536, 1938-1947 

These codes are related to removement of InheritanceKind from 
relation_excluded_by_constraints(), so I think it is something like cleaning of 
unneeded codes. Can we separate this to patch as 
some-code-clearnups-of-0001.patch? Of course, we can do that only if removing 
of these codes from 0001 would not bother success of "make check" of 0001.
I also think that what I pointed out at above 3. and 4. can also be included in 
some-code-clearnups-of-0001.patch.

What do you think?


--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-02-07 Thread Imai, Yoshikazu
Tsunakawa-san,

On Wed, Feb 6, 2019 at 2:04 AM, Tsunakawa, Takayuki wrote:
> Can you compare the performance of auto and force_custom_plan again with
> the attached patch?  It uses PGPROC's LOCALLOCK list instead of the hash
> table.

Thanks for the patch, but it seems to have some problems.
When I executed create/drop/select commands to large partitions, like over than 
512 partitions, backend died unexpectedly. Since I could see the difference of 
the performance of auto and force_custom_plan when partitions is large, patch 
needs to be modified to check whether performance is improved or not.

Thanks
--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-29 Thread Imai, Yoshikazu
On Wed, Jan 30, 2019 at 3:26 PM, Alvaro Herrera wrote:
> On 2019-Jan-30, Imai, Yoshikazu wrote:
> 
> > Why I did these tests is that I wanted to confirm that even if we
> > apply each patch one by one, there's no performance problem. Because
> > patches are quite large, I just felt it might be difficult to commit
> > these patches all at once and I thought committing patch one by one
> > would be another option to commit these patches. I don't know there
> is
> > the rule in the community how patches should be committed, and if
> > there, my thoughts above may be bad.
> 
> There are no absolute rules, but if I was committing it, I would certainly
> commit each separately, mostly because reviewing the whole series at once
> looks daunting ... and given the proposed commit messages, I'd guess that
> writing a combined commit message would also be very difficult.

Ah, I see.
 
> So thanks for doing these tests.

I'm glad to hear that!


Thanks
--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-29 Thread Imai, Yoshikazu
Amit-san,

On Tue, Jan 29, 2019 at 10:11 AM, Amit Langote wrote:
> On 2019/01/28 10:44, Imai, Yoshikazu wrote:
> > On Thu, Jan 24, 2019 at 6:10 AM, Imai, Yoshikazu wrote:
> >> updating partkey case:
> >>
> >> part-num  master 0001 0002 0003 0004
> >> 18215.34  7924.99  7931.15  8407.40  8475.65
> >> 27137.49  7026.45  7128.84  7583.08  7593.73
> >> 45880.54  5896.47  6014.82  6405.33  6398.71
> >> 84222.96  4446.40  4518.54  4802.43  4785.82
> >> 16   2634.91  2891.51  2946.99  3085.81  3087.91
> >> 32935.12  1125.28  1169.17  1199.44  1202.04
> >> 64352.37   405.27   417.09   425.78   424.53
> >> 128   236.26   310.01   307.70   315.29   312.81
> >> 25665.3686.8487.6784.3989.27
> >> 51218.3424.8423.5523.9123.91
> >> 10244.83 6.93 6.51 6.45 6.49
> >
> > I also tested with non-partitioned table case.
> >
> > updating partkey case:
> >
> > part-num  master 0001 0002 0003 0004
> > 010956.7  10370.5  10472.6  10571.0  10581.5
> > 18215.34  7924.99  7931.15  8407.40  8475.65
> > ...
> > 10244.83 6.93 6.51 6.45 6.49
> >
> >
> > In my performance results, it seems update performance degrades in
> non-partitioned case with v17-patch applied.
> > But it seems this degrades did not happen at v2-patch.
> >
> > On Thu, Aug 30, 2018 at 1:45 AM, Amit, Langote wrote:
> >> UPDATE:
> >>
> >> nparts  master00010002   0003
> >> ==  ==   
> >> 0 285628932862   2816
> >
> > Does this degradation only occur in my tests? Or if this result is correct,
> what may causes the degradation?
> 
> I re-ran tests with v18 using the following setup [1]:
> 
> create table ht (a int, b int) partition by hash (b); create table ht_0
> partition of ht for values with (modulus N, remainder 0); ...
> $ cat update-noprune.sql
> update ht set a = 0;
> 
> pgbench -n -T 60 -f update-noprune,sql
> 
> TPS:
> 
> nparts  master00010002   0003   0004
> ==  ==      
> 0   4408  43354423   4379   4314
> 1   3883  38733679   3856   4007
> 2   3495  34763477   3500   3627
> 
> I can see some degradation for small number of partitions, but maybe it's
> just noise?  At least, I don't yet have a systematic explanation for that
> happening.

Thanks for testing.

I also re-ran tests with v18 using settings almost same as I used before, but 
this time I run pgbench for 60 second which was 30 second in previous test.

TPS:

 nparts  master00010002   0003   0004
 ==  ==      
 0   10794 11018   10761  10552  11066
 1   7574  76257558   8071   8219
 2   6745  67786746   7281   7344

I can see no degradation, so I also think that performance degradation in my 
previous test and your test was because of just noise.


Why I did these tests is that I wanted to confirm that even if we apply each 
patch one by one, there's no performance problem. Because patches are quite 
large, I just felt it might be difficult to commit these patches all at once 
and I thought committing patch one by one would be another option to commit 
these patches. I don't know there is the rule in the community how patches 
should be committed, and if there, my thoughts above may be bad.

Anyway, I'll restart code reviewing :)

--
Yoshikazu Imai


RE: speeding up planning with partitions

2019-01-27 Thread Imai, Yoshikazu
On Thu, Jan 24, 2019 at 6:10 AM, Imai, Yoshikazu wrote:
> updating partkey case:
> 
> part-num  master 0001 0002 0003 0004
> 18215.34  7924.99  7931.15  8407.40  8475.65
> 27137.49  7026.45  7128.84  7583.08  7593.73
> 45880.54  5896.47  6014.82  6405.33  6398.71
> 84222.96  4446.40  4518.54  4802.43  4785.82
> 16   2634.91  2891.51  2946.99  3085.81  3087.91
> 32935.12  1125.28  1169.17  1199.44  1202.04
> 64352.37   405.27   417.09   425.78   424.53
> 128   236.26   310.01   307.70   315.29   312.81
> 25665.3686.8487.6784.3989.27
> 51218.3424.8423.5523.9123.91
> 10244.83 6.93 6.51 6.45 6.49

I also tested with non-partitioned table case.

updating partkey case:

part-num  master 0001 0002 0003 0004
010956.7  10370.5  10472.6  10571.0  10581.5
18215.34  7924.99  7931.15  8407.40  8475.65 
...
10244.83 6.93 6.51 6.45 6.49


In my performance results, it seems update performance degrades in 
non-partitioned case with v17-patch applied.
But it seems this degrades did not happen at v2-patch.

On Thu, Aug 30, 2018 at 1:45 AM, Amit, Langote wrote:
> UPDATE:
> 
> nparts  master00010002   0003
> ==  ==   
> 0 285628932862   2816

Does this degradation only occur in my tests? Or if this result is correct, 
what may causes the degradation?

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-23 Thread Imai, Yoshikazu
On Wed, Jan 23, 2019 at 1:35 AM, Amit Langote wrote:
> Rebased due to the heap_open/close() -> table_open/close() change.

Maybe there are not many things I can point out through reviewing the patch, so 
I ran the performance test against v17 patches instead of reviewing codes.
There are already a lot of tests about partition pruning case and we confirmed 
performance improves in those cases. In this time, I tested about accessing all 
partitions case.

I tested with master, master + 0001, master + 0001 + 0002, ..., master + 0001 + 
0002 + 0003 + 0004.
I ran pgbench 3 times in each test case and below results are average of those.

[postgresql.conf]
max_parallel_workers = 0
max_parallel_workers_per_gather = 0

[partition table definitions(8192 partitions case)]
create table rt (a int, b int, c int) partition by range (a)
create table rt_1 partition of rt for values from (1) to (2);
...
create table rt_8192 partition of rt for values from (8191) to (8192);

[pgbench commands]
pgbench -n -f update.sql -T 30 postgres

[update.sql(updating partkey case)]
update rt set a = 1;

[update.sql(updating non-partkey case)]
update rt set b = 1;

[results]
updating partkey case:

part-num  master 0001 0002 0003 0004
18215.34  7924.99  7931.15  8407.40  8475.65 
27137.49  7026.45  7128.84  7583.08  7593.73 
45880.54  5896.47  6014.82  6405.33  6398.71 
84222.96  4446.40  4518.54  4802.43  4785.82 
16   2634.91  2891.51  2946.99  3085.81  3087.91 
32935.12  1125.28  1169.17  1199.44  1202.04 
64352.37   405.27   417.09   425.78   424.53 
128   236.26   310.01   307.70   315.29   312.81 
25665.3686.8487.6784.3989.27 
51218.3424.8423.5523.9123.91 
10244.83 6.93 6.51 6.45 6.49 


updating non-partkey case:

part-num   master0001 0002 0003  0004
18862.58  8421.49  8575.35  9843.71  10065.30   
27715.05  7575.78  7654.28  8800.84   8720.60   
46249.95  6321.32  6470.26  7278.14   7280.10   
84514.82  4730.48  4823.37  5382.93   5341.10   
16   2815.21  3123.27  3162.51  3422.36   3393.94   
32968.45  1702.47  1722.38  1809.89   1799.88   
64364.17   420.48   432.87   440.20435.31   
128   119.94   148.77   150.47   152.18143.35   
25645.0946.3546.9348.30 45.85   
512 8.7410.5910.2310.27 10.13   
10242.28 2.60 2.56 2.57  2.51   


Looking at the results, if we only apply 0001 or 0001 + 0002 and if number of 
partition is few like 1 or 2, performance degrades compare to master(A maximum 
reduction is about 5%, which is 8863->8421).
In all other cases, performance improves compare to master.

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-21 Thread Imai, Yoshikazu
Tsunakawa-san

Thanks for giving the information.
I didn't use it yet, but I just used perf to clarify the difference of before 
and after the creation of the generic plan, and I noticed that usage of 
hash_seq_search() is increased about 3% in EXECUTE queries after the creation 
of the generic plan.

What I understand so far is about 10,000 while loops at total (4098+4098+some 
extra) is needed in hash_seq_search() in EXECUTE query after the creation of 
the generic plan.
10,000 while loops takes about 10 microsec (of course, we can't estimate 
correct time), and the difference of the latency between 5th and 7th EXECUTE is 
about 8 microsec, I currently think this causes the difference.

I don't know this problem relates to Amit-san's patch, but I'll continue to 
investigate it.

Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-21 Thread Imai, Yoshikazu
I measured the latency of queries executed before and after creating generic 
plan with master + v15-patch.

In this test, table is partitioned into 4k partitions.
I executed 400,0001 queries by pgbench.
I changed the timing of creating generic plan at 1st, 10,001st, 20,001st, 
50,001st, ..., 390,001st by changing the source code.
I run the test with setting both plan_cache_mode = auto and plan_cache_mode = 
force_custom_plan.

The results is below.
The value in before columns is showing the average latency of queries before 
creating generic plan in microsec. 
The value in after columns is showing the average latency of queries after 
creating generic plan in microsec.

[auto]
time of creating generic plan  | before[usec]  |  after[usec]
  1st 531 142
 10,001st 144 141
 20,001st 141 144
 50,001st 133 140
200,001st 131 143
390,001st 130 138


[force_custom_plan]
time of creating generic plan  | before[usec]  |  after[usec]
  1st
 10,001st 144 129
 20,001st 137 131
 50,001st 133 134
200,001st 131 133
390,001st 132 131

* A generic plan is actually not created with plan_cache_mode = 
force_custom_plan.


Looking at the results of force_custom_plan, the latency of first 10,000 
transactions is 144 microsec, and the latency of first 50,000 transactions is 
133 microsec. 
I think that is because in the early transactions, relcache hash table is not 
hot (as David mentioned?).

Comparing the latencies in after column between auto and force_custom_plan, 
auto ones are higher about 8% than force_custom_plan ones.
That is, it seems creating generic plan affects the latency of queries executed 
after creating generic plan.


On Mon, Jan 21, 2019 at 1:32 AM, David Rowley wrote:
> It would be interesting to see the profiles of having the generic plan
> being built on the 6th execution vs the 400,000th execution.
> 
> I'd thought maybe one difference would be the relcache hash table
> having been expanded greatly after the generic plan was created

Does it mean that in the executing queries after the generic plan was created, 
the time of searching entry in the relcache hash table becomes slow and it 
increases the latency?

> but I
> see even the generic plan is selecting a random partition, so the
> cache would have ended up with that many items eventually anyway, and
> since we're talking in the odds of 7.8k TPS with 4k partitions, it
> would have only have taken about 2-3 seconds out of the 60 second run
> to hit most or all of those partitions anyway.

And does it mean even if we executes a lot of custom plan without creating 
generic plan, cache would have been ended up to the same size of which is after 
creating generic plan?


Anyway, I'll check the relcache size.
Since I don't know how to get profile at the just time of building generic 
plan, I'll use MemoryContextStats(MemoryContext*) function to check the 
relcache size at before/after building generic plan and at after executing a 
lot of custom plans.



Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-20 Thread Imai, Yoshikazu
Tsunakawa-san

On Thu, Jan 18, 2019 at 5:29 AM, Amit Langote wrote:
> On 2019/01/18 14:12, Tsunakawa, Takayuki wrote:
...
> > Isn't CheckCachedPlan() (and AcquireExecutorLocks() therein) called
> in every EXECUTE after 6th one due to some unknow issue?
> 
> CheckCachedPlan is only called if choose_custom_plan() returns false
> resulting in generic plan being created/reused.  With plan_cache_mode
> = auto, I see it always returns true, because a custom plan containing
> a single partition to scan is way cheaper than the generic plan.
> 
> > Does choose_custom_plan() always return true after 6th EXECUTE?
> 
> Yes.

Yes.
I also checked choose_custom_plan() always returns true excluding 6th EXECUTE
and CheckCachedPlan() is only called at 6th EXECUTE.

Yoshikazu Imai 



RE: speeding up planning with partitions

2019-01-20 Thread Imai, Yoshikazu
Hi Amit-san,

On Wed, Jan 17, 2019 at 10:25 AM, Amit Langote wrote:
> Thank you Imai-san for testing.  Sorry it totally slipped my mind to reply to 
> this email.

Thanks for replying and sorry for my late reply. I've been undertaking 
on-the-job training last week.

> Are you saying that, when using auto mode, all executions of the query
> starting from 7th are slower than the first 5 executions?  That is, the
> latency of creating and using a custom plan increases *after* a generic
> plan is created and discarded on the 6th execution of the query?  If so,
> that is inexplicable to me.

Yes. And it's also inexplicable to me.

I'll check if this fact is really correct by majoring the time of the
first 5 queries before generic plan is created and the other queries
after generic plan is created.

Yoshikazu Imai 



RE: speeding up planning with partitions

2019-01-10 Thread Imai, Yoshikazu
On Thu, Jan 10, 2019 at 6:10 PM, Imai, Yoshikazu wrote:
> > Does that not mean that the if (parent->top_parent_relids) will always
> > be false in build_dummy_partition_rel() and it'll only ever get
> > rtekind == RTE_RELATION?
> 
> At least, I checked if (parent->top_parent_relids) can be true if I
> execute below SQL.
> 
> select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id < 15;

Sorry, I also made mistake. I was executed:
select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.a < 15;

--
Yoshikazu Imai


> -Original Message-
> From: Imai, Yoshikazu [mailto:imai.yoshik...@jp.fujitsu.com]
> Sent: Friday, January 11, 2019 3:10 PM
> To: 'David Rowley' ; Amit Langote
> 
> Cc: Amit Langote ; Alvaro Herrera
> ; Pg Hackers 
> Subject: RE: speeding up planning with partitions
> 
> Hi David,
> 
> On Thu, Jan 10, 2019 at 4:02 PM, David Rowley wrote:
> > 3. I wonder if there's a better way to handle what
> > build_dummy_partition_rel() does. I see the child's relid to the
> > parent's relid and makes up a fake AppendRelInfo and puts it in the
> > parent's slot.  What's going to happen when the parent is not the
> > top-level parent? It'll already have a AppendRelInfo set.
> ...
> >
> > select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id
> > < 10;
> 
> I think there is a mistake in the select SQL.
> "p1.id < 10" doesn't prune any partition because tables are partitioned
> by column "a" in your definition. Isn't it?
> 
> > Does that not mean that the if (parent->top_parent_relids) will always
> > be false in build_dummy_partition_rel() and it'll only ever get
> > rtekind == RTE_RELATION?
> 
> At least, I checked if (parent->top_parent_relids) can be true if I
> execute below SQL.
> 
> select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id
> < 15;
> 
> I couldn't check other points you mentioned, but I also think
> build_dummy_partition_rel() needs more consideration because I felt it
> has complicated logic when I was checking around here.
> 
> 
> Amit,
> I also realized there are some mistakes in the comments around this
> function.
> 
> + * build_dummy_partition_rel
> + *   Build a RelOptInfo and AppendRelInfo for a pruned
> partition
> s/and AppendRelInfo/and an AppendRelInfo/
> 
> +  * Now we'll need a (no-op) AppendRelInfo for parent, because
> we're
> +  * setting the dummy partition's relid to be same as the parent's.
> s/a \(no-op\) AppendRelInfo/an \(no-op\) AppendRelInfo/
> 
> --
> Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-10 Thread Imai, Yoshikazu
Hi David,

On Thu, Jan 10, 2019 at 4:02 PM, David Rowley wrote:
> 3. I wonder if there's a better way to handle what
> build_dummy_partition_rel() does. I see the child's relid to the
> parent's relid and makes up a fake AppendRelInfo and puts it in the
> parent's slot.  What's going to happen when the parent is not the
> top-level parent? It'll already have a AppendRelInfo set.
... 
> 
> select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id < 10;

I think there is a mistake in the select SQL.
"p1.id < 10" doesn't prune any partition because tables are partitioned by
column "a" in your definition. Isn't it?

> Does that not mean that the if (parent->top_parent_relids) will always
> be false in build_dummy_partition_rel() and it'll only ever get
> rtekind == RTE_RELATION?

At least, I checked if (parent->top_parent_relids) can be true if I execute
below SQL.

select * from parent p1 inner join parent p2 on p1.a=p2.a where p1.id < 15;

I couldn't check other points you mentioned, but I also think
build_dummy_partition_rel() needs more consideration because I felt it has
complicated logic when I was checking around here.


Amit,
I also realized there are some mistakes in the comments around this function.

+ * build_dummy_partition_rel
+ * Build a RelOptInfo and AppendRelInfo for a pruned partition 
s/and AppendRelInfo/and an AppendRelInfo/

+* Now we'll need a (no-op) AppendRelInfo for parent, because we're
+* setting the dummy partition's relid to be same as the parent's.
s/a \(no-op\) AppendRelInfo/an \(no-op\) AppendRelInfo/

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-09 Thread Imai, Yoshikazu
Hi Amit,

On Mon, Jan 7, 2019 at 6:30 PM, Amit Langote wrote:
> On 2018/12/27 20:25, Amit Langote wrote:
> > Here's the v10 of the patch.  I didn't get chance to do more changes
> > than those described above and address Imai-san's review comments
> yesterday.
> >
> > Have a wonderful new year!  I'll be back on January 7 to reply on this
> thread.
> 
> Rebased due to changed copyright year in prepunion.c.  Also realized that
> the new files added by patch 0004 still had 2018 in them.

Thank you for new patches.


I also have some comments on 0001, set_inherit_target_rel_sizes().

In set_inherit_target_rel_sizes():

Some codes are placed not the same order as set_append_rel_size().

0001: at line 325-326,
+   ListCell   *l;
+   boolhas_live_children;

In set_append_rel_size(), "has_live_children" is above of the "ListCell *l";

0001: at line 582-603
+   if (IS_DUMMY_REL(childrel))
+   continue;
+
...
+   Assert(childrel->rows > 0);
+
+   /* We have at least one live child. */
+   has_live_children = true;

In set_append_rel_size(), 
+   /* We have at least one live child. */
+   has_live_children = true;
is directly under of
+   if (IS_DUMMY_REL(childrel))
+   continue;

0001: at line 606-622,
+   if (!has_live_children)
+   {
+   /*
+* All children were excluded by constraints, so mark the 
relation
+* ass dummy.  We must do this in this phase so that the rel's
+* dummy-ness is visible when we generate paths for other rels.
+*/
+   set_dummy_rel_pathlist(rel);
+   }
+   else
+   /*
+* Set a non-zero value here to cope with the caller's 
requirement
+* that non-dummy relations are actually not empty.  We don't 
try to
+* be accurate here, because we're not going to create a path 
that
+* combines the children outputs.
+*/
+   rel->rows = 1;

In set_append_rel_size(), a condition of if clause is not !has_live_children 
but 
has_live_children.

I also noticed there isn't else block while there is if block.

--
Yoshikazu Imai



RE: speeding up planning with partitions

2019-01-08 Thread Imai, Yoshikazu
Hi,

I ran the performance tests for no prepared query and for prepared query with
plan_cache_mode='auto' and plan_cache_mode='force_custom_plan'. I also changed
number of partitions as 256 or 4096. I ran the tests on master and v9-patched.

[settings]
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

[partitioning table definitions(with 4096 partitions)]
create table rt (a int, b int, c int) partition by range (a);

\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

[pgbench(with 4096 partitions)]
no prepared: pgbench -n -f select4096.sql -T 60
prepared:pgbench -n -f select4096.sql -T 60 -M prepared

[select4096.sql]
\set a random(1, 4096)
select a from rt where a = :a;

[results]
master:
  part_num  no-prepared   auto   force_custom_plan  (1-auto/force_custom_plan)
   256  604571 5760.01
  4096 17.5   17.515.1   -0.16

patched:
  part_num  no-prepared   auto   force_custom_plan
   256 8614   94469384  -0.006
  4096 7158   71657864   0.089


There are almost no difference between auto and force_custom_plan with 256
partitions, but there are difference between auto and force_custom_plan with
4096 partitions. While auto is faster than force_custom_plan on master,
force_custom_plan is faster than auto on patched.

I wonder why force_custom_plan is faster than auto after applied the patch.

When we use PREPARE-EXECUTE, a generic plan is created and used if its cost is
cheaper than creating and using a custom plan with plan_cache_mode='auto',
while a custom plan is always created and used with 
plan_cache_mode='force_custom_plan'.
So one can think the difference in above results is because of creating or
using a generic plan.

I checked how many times a generic plan is created during executing pgbench and
found a generic plan is created only once and custom plans are created at other
times with plan_cache_mode='auto'. I also checked the time of creating a
generic plan, but it didn't take so much(250ms or so with 4096 partitions). So
the time of creating a generic plan does not affect the performance.

Currently, a generic plan is created at sixth time of executing EXECUTE query.
I changed it to more later (ex. at 400,000th time of executing EXECUTE query on
master with 4096 partitions, because 7000TPS x 60sec=420, transactions are
run while executing pgbench.), then there are almost no difference between auto
and force_custom_plan. I think that creation of a generic plan affects the time
of executing queries which are ordered after creating generic plan.

If my assumption is right, we can expect some additional process is occurred at
executing queries ordered after creating a generic plan, which results in auto 
is
slower than force_custom_plan because of additional process. But looking at
above results, on master with 4096 partitions, auto is faster than 
force_custom_plan.
So now I am confused.

Do you have any ideas what does affect the performance?

--
Yoshikazu Imai


RE: speeding up planning with partitions

2018-12-24 Thread Imai, Yoshikazu
Hi, Amit,

On Fri, Dec 7, 2018 at 0:57 AM, Imai, Yoshikazu wrote:
> OK. I will continue the review of 0001 before/after your supplying of
> next patch with keeping those in mind.

Here's the continuation of the review. Almost all of below comments are
little fixes.

---
0001: line 76-77
In commit message:
  exclusion for target child relation, which is no longer
  is no longer needed.  Constraint exclusion runs during query_planner

s/which is no longer is no longer needed/which is no longer needed/

---
0001: line 464
+   if (IS_DUMMY_REL(find_base_rel(root, resultRelation )))

s/resultRelation )))/resultRelation)))/
(There is an extra space.)

---
0001: line 395-398
+* Reset inh_target_child_roots to not be same as parent root's so that
+* the subroots for this child's own children (if any) don't end up in
+* root parent's list.  We'll eventually merge all entries into one 
list,
+* but that's now now.

s/that's now now/that's not now/

---
0001: line 794
+* are put into  a list that will be controlled by a single ModifyTable

s/are put into  a list/are put into a list/
(There are two spaces between "into" and "a".)

---
0001: line 241-242, 253-254, 291-294 (In set_append_rel_size())

+   if (appinfo->parent_relid == root->parse->resultRelation)
+   subroot = adjust_inherit_target_child(root, childrel, 
appinfo);

+   add_child_rel_equivalences(subroot, appinfo, rel, 
childrel,
+  root 
!= subroot);

+   if (subroot != root)
+   {
+   root->inh_target_child_roots =
+   
lappend(root->inh_target_child_roots, subroot);

A boolean value of "appinfo->parent_relid == root->parse->resultRelation" is
same with "subroot != root"(because of line 241-242), so we can use bool
variable here like
child_is_target = (appinfo->parent_relid == root->parse->resultRelation).
The name of child_is_target is brought from arguments of
add_child_rel_equivalences() in your patch.

I attached a little diff as "v9-0001-delta.diff".

---
0001: line 313-431

adjust_inherit_target_child() is in allpaths.c in your patch, but it has the
code like below ones which are in master's(not patch applied) planner.c or
planmain.c, so could it be possible in planner.c(or planmain.c)?
This point is less important, but I was just wondering whether 
adjust_inherit_target_child() should be in allpaths.c, planner.c or planmain.c.

+   /* Translate the original query's expressions to this child. */
+   subroot = makeNode(PlannerInfo);
+   memcpy(subroot, root, sizeof(PlannerInfo));

+   root->parse->targetList = root->unexpanded_tlist;
+   subroot->parse = (Query *) adjust_appendrel_attrs(root,
+   
  (Node *) root->parse,
+   
  1, &appinfo);

+   tlist = preprocess_targetlist(subroot);
+   subroot->processed_tlist = tlist;
+   build_base_rel_tlists(subroot, tlist);

---
0001: line 57-70

In commit message:
  This removes some existing code in inheritance_planner that dealt
  with any subquery RTEs in the query.  The rationale of that code
  was that the subquery RTEs may change during each iteration of
  planning (that is, for different children), so different iterations
  better use different copies of those RTEs.  
  ...
  Since with the new code
  we perform planning just once, I think we don't need this special
  handling.

0001: line 772-782
-* controlled by a single ModifyTable node.  All the instances share the
-* same rangetable, but each instance must have its own set of subquery
-* RTEs within the finished rangetable because (1) they are likely to 
get
-* scribbled on during planning, and (2) it's not inconceivable that
-* subqueries could get planned differently in different cases.  We need
-* not create duplicate copies of other RTE kinds, in particular not the
-* target relations, because they don't have either of those issues.  
Not
-* having to duplicate the target relations is important because doing 
so
-* (1) would result in a rangetable of length O(N^2) for N targets, with
-* at least O(N^3) work expended here; and (2) would greatly complicate
-* management of the rowMarks list.

I used considerable time to confirm there are no needs copying subquery RTEs in
the new code, but so far I couldn't. If copied RTEs are only used when planning,
it might not need to copy RTEs in the

RE: speeding up planning with partitions

2018-12-06 Thread Imai, Yoshikazu
Hi, Amit

Thanks for the quick modification.

On Wed, Dec 5, 2018 at 8:26 PM, Amit Langote wrote:
> > 1.
...
> > 5.
> > 0003: line 1620-1621
> > 
> > + * After creating the RelOptInfo for the given child RT index, it goes on 
> > to
> > + * initialize some of its fields base on the parent RelOptInfo.
> > 
> > s/fields base on/fields based on/
> 
> Fixed all of 1-5.

Thanks for fixing.

> > 6.
> > parsenodes.h
> >  906  *inh is true for relation references that should be expanded to 
> > include
> >  907  *inheritance children, if the rel has any.  This *must* be false 
> > for
> >  908  *RTEs other than RTE_RELATION entries.
> > 
> > I think inh can become true now even if RTEKind equals RTE_SUBQUERY, so 
> > latter
> > sentence need to be modified.
> 
> 
> 
> Seems like an existing comment bug.  Why don't you send a patch as you
> discovered it? :)

Thanks, I am pleased with your proposal. I'll post it as a small fix of the 
comment.

> > 7.
> > 0005: line 109-115
... 
> Un-pruned partitions may become dummy due to contradictory constraints or
> constraint exclusion using normal CHECK constraints later and whether it's
> dummy is checked properly by functions that iterate over live_parts.

Ah, I understand partitions are eliminated contradictory constraints or
constraint exclusion, both using constraints.

> Attached updated patches.  I have a few other changes in mind to make to
> 0001 such that the range table in each child's version of Query contains
> only that child table in place of the original target relation, instead of
> *all* child tables which is the current behavior.  The current behavior
> makes range_table_mutator a big bottleneck when the number of un-pruned
> target children is large.  But I'm saving it for the next week so that I

OK. I will continue the review of 0001 before/after your supplying of next
patch with keeping those in mind.

> can prepare for the PGConf.ASIA that's starting on Monday next week.  See
> you there. :)

Yeah, see you there. :)


--
Yoshikazu Imai



RE: speeding up planning with partitions

2018-12-04 Thread Imai, Yoshikazu
Hi Amit,

On Tue, Nov 20, 2018 at 10:24 PM, Amit Langote wrote:
> Attached v8 patches.

Thanks for the patch. I took a look 0003, 0005, 0006 of v8 patch.

1.
0003: line 267-268
+* Child relation may have be marked dummy if 
build_append_child_rel
+* found self-contradictory quals.

/s/have be marked/have been marked/

2.
0003: around line 1077
In append.c(or prepunion.c)
228  * Check that there's at least one descendant, else treat as no-child
229  * case.  This could happen despite above has_subclass() check, if table
230  * once had a child but no longer does.

has_subclass() check is moved to subquery_planner from above this code,
so the comments need to be modified like below.

s/above has_subclass() check/has_subclass() check in subquery_planner/

3.
0003: line 1241-1244
0006: line ?

In comments of expand_partitioned_rtentry:
+ * Note: even though only the unpruned partitions will be added to the
+ * resulting plan, this still locks *all* partitions via find_all_inheritors
+ * in order to avoid partitions being locked in a different order than other
+ * places in the backend that may lock partitions.

This comments is no more needed if 0006 patch is applied because
find_all_inheritors is removed in the 0006 patch.

4.
0003: line 1541-1544

+* Add the RelOptInfo.  Even though we may not really scan this relation
+* for reasons such as contradictory quals, we still need need to create
+* one, because for every RTE in the query's range table, there must be 
an
+* accompanying RelOptInfo.

s/need need/need/

5.
0003: line 1620-1621

+ * After creating the RelOptInfo for the given child RT index, it goes on to
+ * initialize some of its fields base on the parent RelOptInfo.

s/fields base on/fields based on/

6.
parsenodes.h
 906  *inh is true for relation references that should be expanded to 
include
 907  *inheritance children, if the rel has any.  This *must* be false for
 908  *RTEs other than RTE_RELATION entries.

I think inh can become true now even if RTEKind equals RTE_SUBQUERY, so latter
sentence need to be modified.

7.
0005: line 109-115
+   /*
+* If partition is excluded by constraints, remove it from
+* live_parts, too.
+*/
+   if (IS_DUMMY_REL(childrel))
+   parentrel->live_parts = 
bms_del_member(parentrel->live_parts, i);
+

When I read this comment, I imagined that relation_excluded_by_constraints()
would be called before this code. childrel is marked dummy if
build_append_child_rel found self-contradictory quals, so comments can be
modified more correctly like another comments in your patch as below.

In 0003: line 267-271
+* Child relation may have be marked dummy if 
build_append_child_rel
+* found self-contradictory quals.
+*/
+   if (IS_DUMMY_REL(childrel))
+   continue;

8.
0003: line 705-711
+* Query planning may have added some columns to the top-level 
tlist,
+* which happens when there are row marks applied to inheritance
+* parent relations (additional junk columns needed for 
applying row
+* marks are added after expanding inheritance.)
+*/
+   if (list_length(tlist) < list_length(root->processed_tlist))
+   tlist = root->processed_tlist;

In grouping_planner():

  if (planned_rel == NULL)
  {
  ...
  root->processed_tlist = tlist;
  }
  else
  tlist = root->processed_tlist;
  ...
  if (current_rel == NULL)
  current_rel = query_planner(root, tlist,
  standard_qp_callback, &qp_extra);
  ...
  /*
   * Query planning may have added some columns to the top-level tlist,
   * which happens when there are row marks applied to inheritance
   * parent relations (additional junk columns needed for applying row
   * marks are added after expanding inheritance.)
   */
  if (list_length(tlist) < list_length(root->processed_tlist))
  tlist = root->processed_tlist;


Are there any case tlist points to an address different from
root->processed_tlist after calling query_planner? Junk columns are possibly
added to root->processed_tlist after expanding inheritance, but that adding
process don't change the root->processed_tlist's pointer address.
I checked "Assert(tlist == root->processed_tlist)" after calling query_planner
passes "make check".

9.
0003: line 1722-1763
In build_append_child_rel():

+   /*
+* In addition to the quals inherited from the parent, we might
+* have securityQuals associated with this particular child node.
+* (Currently this can only happen in appendrels originating from
+* UNION ALL; inheritance child tables don't have their own
+* securityQuals.)  Pull any such securityQuals up into the
...
+  

RE: speeding up planning with partitions

2018-11-14 Thread Imai, Yoshikazu
Hi Amit,

On Tue, Nov 13, 2018 at 10:29 PM, Amit Langote wrote:
> On 2018/11/12 13:35, Imai, Yoshikazu wrote:
> > adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1
> > adjust_appendrel_attrs_multilevel for leaf2: root -> sub1 -> leaf2
> 
> Ah, I see what you mean.
> 
> The root -> sub1 translation will be repeated for each leaf partition
> if done via adjust_appendrel_attrs_multilevel.  On the other hand, if
> we could do the root to sub1 translation once and pass it to the recursive
> call using sub1 as the parent.
> 
> I've changed the patch use adjust_appendrel_attrs.
> 
> > Since it is difficult to explain my thoughts with words, I will show
> > the performance degration case.
> >
> > Partition tables are below two sets.
> 
> [ ... ]
> 
> > Create a generic plan of updation or deletion.
> >
> > [create a delete generic plan]
> > set plan_cache_mode = 'force_generic_plan'; prepare delete_stmt(int)
> > as delete from rt where b = $1; execute delete_stmt(1);
> 
> [ ... ]
> 
> > How amount of memory is used with above tests is...
> >
> > without v5 patches, Set1: 242MB
> > without v5 patches, Set2: 247MB
> > with v5 patches, Set1: 420MB
> > with v5 patches, Set2: 820MB
> 
> Although I didn't aim to fix planning for the generic plan case where
> no pruning occurs, the above result is not acceptable.  That is, the new
> implementation of inheritance update/delete planning shouldn't consume
> more memory than the previous.  In fact, it should've consumed less,
> because the patch claims that it gets rid of redundant processing per
> partition.
> 
> I understood why update/delete planning consumed more memory with the
> patch.  It was due to a problem with the patch that modifies inheritance
> update/delete planning.  The exact problem was that the query tree would
> be translated (hence copied) *twice* for every partition!  First during
> query planning where the query tree would be translated to figure out
> a targetlist for partitions and then again before calling
> grouping_planner.
> Also, the adjust_appendrel_attrs_multilevel made it worse for
> multi-level partitioning case, because of repeated copying for root to
> intermediate partitioned tables, as Imai-san pointed out.
> 
> I've fixed that making sure that query tree is translated only once and
> saved for later steps to use.  Imai-san, please check the memory
> consumption with the latest patch.

Thanks for fixing!
Now, memory consumption is lower than the previous.

with v7 patches, Set1: 223MB
with v7 patches, Set2: 226MB

Thanks,

--
Yoshikazu Imai


RE: speeding up planning with partitions

2018-11-11 Thread Imai, Yoshikazu
Hi Amit,

On Thu, Nov 8, 2018 at 8:26 PM, Amit Langote wrote:
> On 2018/11/07 10:00, Imai, Yoshikazu wrote:
> > About inheritance_make_rel_from_joinlist(), I considered how it processes
> > joins for sub-partitioned-table.
> > 
> > sub-partitioned-table image:
> > part
> >   sub1
> > leaf1
> > leaf2
> > 
> > inheritance_make_rel_from_joinlist() translates join_list and join_info_list
> > for each leafs(leaf1, leaf2 in above image). To translate those two lists 
> > for
> > leaf1, inheritance_make_rel_from_joinlist() translates lists from part to 
> > sub1
> > and nextly from sub1 to leaf1. For leaf2, 
> > inheritance_make_rel_from_joinlist() 
> > translates lists from part to sub1 and from sub1 to leaf2. There are same
> > translation from part to sub1, and I think it is better if we can eliminate 
> > it.
> > I attached 0002-delta.patch.
> > 
> > In the patch, inheritance_make_rel_from_joinlist() translates lists not 
> > only for
> > leafs but for mid-nodes, in a depth-first manner, so it can use lists of
> > mid-nodes for translating lists from mid-node to leafs, which eliminates 
> > same
> > translation.
> 
> I don't think the translation happens twice for the same leaf partitions.
> 
> Applying adjust_appendrel_attrs_*multilevel* for only leaf nodes, as
> happens with the current code, is same as first translating using
> adjust_appendrel_attrs from part to sub1 and then from sub1 to leaf1 and
> leaf2 during recursion with sub1 as the parent.

Thanks for replying.
I interpreted your thoughts about translation as below.

adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1
adjust_appendrel_attrs_multilevel for leaf2: sub1(produced at above) -> leaf2

But I wonder translation of leaf2 actually reuses the results of sub1 which is
produced at leaf1 translation. ISTM translation for leaf1, leaf2 are executed
as below.

adjust_appendrel_attrs_multilevel for leaf1: root -> sub1 -> leaf1
adjust_appendrel_attrs_multilevel for leaf2: root -> sub1 -> leaf2


> > I think it might be better if we can apply same logic to 
> > inheritance_planner(),
> > which once implemented the same logic as below. 
> > 
> > 
> > foreach(lc, root->append_rel_list)
> > {
> > AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
> > ...
> > /*
> >  * expand_inherited_rtentry() always processes a parent before any 
> > of
> >  * that parent's children, so the parent_root for this relation 
> > should
> >  * already be available.
> >  */
> > parent_root = parent_roots[appinfo->parent_relid];
> > Assert(parent_root != NULL);
> > parent_parse = parent_root->parse;
> > ...
> > subroot->parse = (Query *)
> > adjust_appendrel_attrs(parent_root,
> >(Node *) parent_parse,
> >1, &appinfo);
> 
> Actually, inheritance_planner is also using
> adjust_appendrel_attrs_multilevel.  I'm not sure if you're looking at the
> latest (10/26) patch.

Sorry for my poor explanation. I described the above code as old code which is
not patch applied. 


Since it is difficult to explain my thoughts with words, I will show the
performance degration case.

Partition tables are below two sets.

Set1:
[create 800 partitions directly under root]
CREATE TABLE rt (a int, b int) PARTITION BY RANGE (a);
\o /dev/null
SELECT 'CREATE TABLE leaf' || x::text || ' PARTITION OF rt FOR VALUES FROM ('
|| (x)::text || ') TO (' || (x+1)::text || ');' FROM generate_series(0, 800) x;
\gexec
\o

Set2:
[create 800 partitions under a partitioned table which is under root]
CREATE TABLE rt (a int, b int) PARTITION BY RANGE (a);
CREATE TABLE sub1 PARTITION OF rt FOR VALUES FROM (0) TO (100) PARTITION BY
RANGE (b);
\o /dev/null
SELECT 'CREATE TABLE leaf' || x::text || ' PARTITION OF sub1 FOR VALUES FROM ('
|| (x)::text || ') TO (' || (x+1)::text || ');' FROM generate_series(0, 800) x;
\gexec
\o


Create a generic plan of updation or deletion.

[create a delete generic plan]
set plan_cache_mode = 'force_generic_plan';
prepare delete_stmt(int) as delete from rt where b = $1;
execute delete_stmt(1);


In creating generic plan, paths/plans for all partitions are created because
we don't know which plan is used before "EXECUTE" command happens.
In creating paths in inheritance_planner(),
adjust_appendrel_attrs()/adjust_appendrel_attrs_multilevel() is executed many
times and it allocates a

RE: Small performance tweak to run-time partition pruning

2018-11-07 Thread Imai, Yoshikazu
On Mon, Oct 22, 2018 at 8:31 PM, David Rowley wrote:
> On 18 October 2018 at 16:13, Imai, Yoshikazu
>  wrote:
> > The patch improves the performance about 1.3% which is less than
> > David's result, but it seems still improves the performance.
> 
> Thanks for doing these benchmarks.
> 
> The speedup is small, but it becomes much more significant once other
> bottlenecks are removed. More partitions may show a larger increase, but
> more partitions also means that a larger range table array gets built
> during ExecInitRangeTable(), which is also slow.

Since I did enough tests and ISTM the patch is sophisticated,
I changed the state of this patch to "Ready for Committer".

--
Yoshikazu Imai



RE: speeding up planning with partitions

2018-11-06 Thread Imai, Yoshikazu
About inheritance_make_rel_from_joinlist(), I considered how it processes
joins for sub-partitioned-table.

sub-partitioned-table image:
part
  sub1
leaf1
leaf2

inheritance_make_rel_from_joinlist() translates join_list and join_info_list
for each leafs(leaf1, leaf2 in above image). To translate those two lists for
leaf1, inheritance_make_rel_from_joinlist() translates lists from part to sub1
and nextly from sub1 to leaf1. For leaf2, inheritance_make_rel_from_joinlist() 
translates lists from part to sub1 and from sub1 to leaf2. There are same
translation from part to sub1, and I think it is better if we can eliminate it.
I attached 0002-delta.patch.

In the patch, inheritance_make_rel_from_joinlist() translates lists not only for
leafs but for mid-nodes, in a depth-first manner, so it can use lists of
mid-nodes for translating lists from mid-node to leafs, which eliminates same
translation.

I think it might be better if we can apply same logic to inheritance_planner(),
which once implemented the same logic as below. 


foreach(lc, root->append_rel_list)
{
AppendRelInfo *appinfo = lfirst_node(AppendRelInfo, lc);
...
/*
 * expand_inherited_rtentry() always processes a parent before any of
 * that parent's children, so the parent_root for this relation should
 * already be available.
 */
parent_root = parent_roots[appinfo->parent_relid];
Assert(parent_root != NULL);
parent_parse = parent_root->parse;
...
subroot->parse = (Query *)
adjust_appendrel_attrs(parent_root,
   (Node *) parent_parse,
   1, &appinfo);


--
Yoshikazu Imai



0002-delta.patch
Description: 0002-delta.patch


RE: speeding up planning with partitions

2018-11-05 Thread Imai, Yoshikazu
Hi,

On Thu, Oct 25, 2018 at 10:38 PM, Amit Langote wrote:
> And here is the latest set of patches.  Sorry it took me a while.

Thanks for revising the patch!

> I didn't get time today to repeat the performance tests, but I'm planning
> to next week.  In the meantime, please feel free to test them and provide
> comments on the code changes.

Since it takes me a lot of time to see all of the patches, I will post comments
little by little from easy parts like typo check.


1.
0002:
+ * inheritance_make_rel_from_joinlist
+ * Perform join planning for all non-dummy leaf inheritance chilren
+ * in their role as query's target relation

"inheritance chilren" -> "inheritance children"


2.
0002:
+   /*
+* Sub-partitions have to be processed recursively, because
+* AppendRelInfoss link sub-partitions to their immediate 
parents, not
+* the root partitioned table.
+*/

AppendRelInfoss -> AppendRelInfos (?)


3.
0002:
+   /* Reset interal join planning data structures. */

interal -> internal


4.
0004:
-static void expand_inherited_rtentry(PlannerInfo *root, RangeTblEntry *rte,
-Index rti);

Comments referring to expand_inherited_rtentry() is left.

backend/optimizer/plan/planner.c:1310:
   * Because of the way expand_inherited_rtentry works, that should be
backend/optimizer/plan/planner.c:1317:
   * Instead the duplicate child RTE created by expand_inherited_rtentry
backend/optimizer/util/plancat.c:118:
* the rewriter or when expand_inherited_rtentry() added it to the query's
backend/optimizer/util/plancat.c:640:
* the rewriter or when expand_inherited_rtentry() added it to the query's

About the last two comments in the above,
"the rewriter or when expand_inherited_rtentry() added it to the query's"
would be
"the rewriter or when add_inheritance_child_rel() added it to the query's".

I don't know how to correct the first two comments.


5.
0004:
-static void expand_partitioned_rtentry(PlannerInfo *root,
-  RangeTblEntry *parentrte,
-  Index parentRTindex, 
Relation parentrel,
-  PlanRowMark *top_parentrc, 
LOCKMODE lockmode,
-  List **appinfos);

Comments referring to expand_partitioned_rtentry() is also left.

backend/executor/execPartition.c:941:
 /*
  * get_partition_dispatch_recurse
  *  Recursively expand partition tree rooted at rel
  *
  * As the partition tree is expanded in a depth-first manner, we maintain two
  * global lists: of PartitionDispatch objects corresponding to partitioned
  * tables in *pds and of the leaf partition OIDs in *leaf_part_oids.
  *
  * Note that the order of OIDs of leaf partitions in leaf_part_oids matches
  * the order in which the planner's expand_partitioned_rtentry() processes
  * them.  It's not necessarily the case that the offsets match up exactly,
  * because constraint exclusion might prune away some partitions on the
  * planner side, whereas we'll always have the complete list; but unpruned
  * partitions will appear in the same order in the plan as they are returned
  * here.
  */

I think the second paragraph of the comments is no longer correct.
expand_partitioned_rtentry() expands the partition tree in a depth-first
manner, whereas expand_append_rel() doesn't neither in a depth-first manner
nor a breadth-first manner as below.

partitioned table tree image:
pt
  sub1
sub1_1
  leaf0
leaf1
  sub2
leaf2
  leaf3

append_rel_list(expanded by expand_partitioned_rtentry):
  [sub1, sub1_1, leaf0, leaf1, sub2, leaf2, leaf3]

append_rel_list(expanded by expand_append_rel):
  [sub1, sub2, leaf3, sub1_1, leaf1, leaf0, leaf2]


6.
0004
+   /*
+* A partitioned child table with 0 children is a dummy rel, so don't
+* bother creating planner objects for it.
+*/
+   if (childrel->rd_rel->relkind == RELKIND_PARTITIONED_TABLE)
+   {
+   PartitionDesc partdesc = RelationGetPartitionDesc(childrel);
+
+   Assert(!RELATION_IS_OTHER_TEMP(childrel));
+   if (partdesc->nparts == 0)
+   {
+   heap_close(childrel, NoLock);
+   return NULL;
+   }
+   }
+
+   /*
+* If childrel doesn't belong to this session, skip it, also 
relinquishing
+* the lock.
+*/
+   if (RELATION_IS_OTHER_TEMP(childrel))
+   {
+   heap_close(childrel, lockmode);
+   return NULL;
+   }

If we process the latter if block before the former one, Assert can be excluded
from the code.



It might be difficult to me to examine the codes whether there exists any 
logical
wrongness along with significant planner code changes, but I will t

RE: Small performance tweak to run-time partition pruning

2018-10-17 Thread Imai, Yoshikazu
Sorry for the delay in replying.

On Wed, Oct 10, 2018 at 4:05 PM, David Rowley wrote:
> > It seems to me that there is no problem in this patch as far.
> > Is there another thing I have to do for the review?
> 
> There's a checklist in [1]. Perhaps there's something mentioned there
> that you've missed.
> 
> [1] https://wiki.postgresql.org/wiki/Reviewing_a_Patch

Thanks for the URL.


I did the performance test which is almost same as you did but only changed
"\set p_id 1" in select.sql for another performance test that I will send
that result in next mail. Maybe it doesn't affect the performance, so it's ok.

I tested with master(28d750c) + v2patch.

[Unpatched]
ave 3669 TPS

tps = 3700.319242 (excluding connections establishing)
tps = 3642.287089 (excluding connections establishing)
tps = 3668.243399 (excluding connections establishing)
tps = 3689.457722 (excluding connections establishing)
tps = 3714.309178 (excluding connections establishing)
tps = 3697.488958 (excluding connections establishing)
tps = 3573.372327 (excluding connections establishing)
tps = 3620.473191 (excluding connections establishing)
tps = 3689.794860 (excluding connections establishing)
tps = 3692.317099 (excluding connections establishing)

[Patched]
ave 3718 TPS

tps = 3751.639616 (excluding connections establishing)
tps = 3736.482071 (excluding connections establishing)
tps = 3747.613223 (excluding connections establishing)
tps = 3745.578446 (excluding connections establishing)
tps = 3662.612013 (excluding connections establishing)
tps = 3715.271028 (excluding connections establishing)
tps = 3718.671552 (excluding connections establishing)
tps = 3698.766946 (excluding connections establishing)
tps = 3639.026099 (excluding connections establishing)
tps = 3760.507508 (excluding connections establishing)

The patch improves the performance about 1.3% which is less than David's
result, but it seems still improves the performance.


Above performance test don't exec run-time pruning and we can't check the 
performance of the loop codes patch modified, so I also did the below test
which do exec run-time pruning.

setup:
CREATE TABLE partbench (id BIGINT NOT NULL, i1 INT NOT NULL, i2 INT
NOT NULL, i3 INT NOT NULL, i4 INT NOT NULL, i5 INT NOT NULL) PARTITION
BY RANGE (id);

\o /dev/null
select 'CREATE TABLE partbench' || x::text || ' PARTITION OF partbench
FOR VALUES FROM (' || (x*10)::text || ') TO (' ||
((x+1)*10)::text || ') PARTITION BY RANGE (i1);'
from generate_Series(0,299) x;
\gexec
\o

\o /dev/null
select 'CREATE TABLE partbench' || x::text ||
'_i1a PARTITION OF partbench' || x::text ||
' FOR VALUES FROM (0) TO (1);' from generate_Series(0,299) x;
\gexec
\o

\o /dev/null
select 'CREATE TABLE partbench' || x::text ||
'_i1b PARTITION OF partbench' || x::text ||
' FOR VALUES FROM (1) TO (2);' from generate_Series(0,299) x;
\gexec
\o


select-exec-init.sql:
\set p_id 1
select * from partbench where id = :p_id and i1 = (select 0);


[Unpatched]
ave  1060 TPS

tps = 1067.286500 (excluding connections establishing)
tps = 1052.705136 (excluding connections establishing)
tps = 1056.684966 (excluding connections establishing)
tps = 1059.803865 (excluding connections establishing)
tps = 1053.418776 (excluding connections establishing)
tps = 1053.383518 (excluding connections establishing)
tps = 1058.542617 (excluding connections establishing)
tps = 1071.875455 (excluding connections establishing)
tps = 1058.064092 (excluding connections establishing)
tps = 1066.869393 (excluding connections establishing)

[Patched]
ave 1069 TPS

tps = 1071.621247 (excluding connections establishing)
tps = 1067.881709 (excluding connections establishing)
tps = 1073.274357 (excluding connections establishing)
tps = 1058.648528 (excluding connections establishing)
tps = 1068.490598 (excluding connections establishing)
tps = 1064.739885 (excluding connections establishing)
tps = 1069.189778 (excluding connections establishing)
tps = 1070.253092 (excluding connections establishing)
tps = 1070.395411 (excluding connections establishing)
tps = 1071.003647 (excluding connections establishing)

The patch also improves the performance about 0.85% in this case.

--
Yoshikazu Imai



RE: Small performance tweak to run-time partition pruning

2018-10-10 Thread Imai, Yoshikazu
On Tue, Oct 9, 2018 at 1:24 AM, I wrote:
> On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote:
> > I've attached an updated patch which skips the re-sequence work when
> > doing that is not required for anything.
> 
> 
> 
> I saw the patch and it seems good to me about the codes.
> I still couldn't check additional test cases in the patch.

I checked an additional test which is:

On Thu, Sept 6, 2018 at 7:30 PM, David Rowley wrote:
> I've also included an additional test to ensure the other_subplans
> gets updated correctly. The other tests for this seem to only perform
> run-time pruning during init plan and do no further pruning, so don't
> fully test that other_subplans gets updated correctly.

I execute the sql in this test with gdb and confirmed that it tests
other_subplans gets updated correctly. It also performs exec run-time pruning
and actually is through the codes in the patch which update other_subplans.

I also did "check world" at the latest master e9edc1ba0b and all tests passed 
successfully.

It seems to me that there is no problem in this patch as far.
Is there another thing I have to do for the review?


--
Yoshikazu Imai


RE: Why we allow CHECK constraint contradiction?

2018-10-10 Thread Imai, Yoshikazu
On Tue, Oct 9, 2018 at 6:01 PM, Amit Langote wrote:
> On 2018/10/10 14:25, Imai, Yoshikazu wrote:
> > Hi, all.
> >
> > I have a wonder about the behaviour of creating table which has a
> > constraint contradiction.
> >
> > I created below table.
> >
> > bugtest=# create table ct (a int, CHECK(a is not null and a >= 0 and
> a
> > < 100 and a >= 200 and a < 300)); bugtest=# \d+ ct
> > Table "public.ct"
> >  Column |  Type   | Collation | Nullable | Default | Storage | Stats
> target | Description
> >
> +-+---+--+-+-+--
> +-
> >  a  | integer |   |  | | plain   |
> |
> > Check constraints:
> > "ct_a_check" CHECK (a IS NOT NULL AND a >= 0 AND a < 100 AND a >=
> > 200 AND a < 300)
> >
> >
> > Are there any rows which can satisfy the ct's CHECK constraint? If
> > not, why we allow creating table when check constraint itself is
> contradicted?
> >
> >
> > I originally noticed this while creating partitioned range table as
> below.
> >
> > bugtest=# create table rt (a int) partition by range (a); bugtest=#
> > create table rt_sub1 partition of rt for values from (0) to (100)
> > partition by range (a); bugtest=# create table rt_sub2 partition of
> rt
> > for values from (100) to (200) partition by range (a); bugtest=#
> > create table rt150 partition of rt_sub1 for values from (150) to (151);
> bugtest=# \d+ rt_sub1
> >   Table "public.rt_sub1"
> >  Column |  Type   | Collation | Nullable | Default | Storage | Stats
> target | Description
> >
> +-+---+--+-+-+--
> +-
> >  a  | integer |   |  | | plain   |
> |
> > Partition of: rt FOR VALUES FROM (0) TO (100) Partition constraint:
> > ((a IS NOT NULL) AND (a >= 0) AND (a < 100)) Partition key: RANGE (a)
> > Partitions: rt150 FOR VALUES FROM (150) TO (151)
> >
> > bugtest=# \d+ rt150
> >Table "public.rt150"
> >  Column |  Type   | Collation | Nullable | Default | Storage | Stats
> target | Description
> >
> +-+---+--+-+-+--
> +-
> >  a  | integer |   |  | | plain   |
> |
> > Partition of: rt_sub1 FOR VALUES FROM (150) TO (151) Partition
> > constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 100) AND (a IS NOT
> > NULL) AND (a >= 150) AND (a < 151))
> >
> >
> > Any rows are not routed to rt150 through rt nor we can't insert any
> > rows to
> > rt150 directly because of its constraints. If we add check whether
> > constraint is contradicted, it prevent us from accidentally creating
> > useless table like above rt150 which would not contain any rows.
> >
> > I thought there might be a discussion or documentation about this, but
> > I couldn't find it. If there is, please also tell me that.
> 
> I had wondered about it when developing the partitioning feature about
> a couple of years ago and this is the response I'd gotten:
> 
> https://www.postgresql.org/message-id/CA+TgmoaQABrsLQK4ms_4NiyavyJGS
> -b6zfkzbbnc+-p5djj...@mail.gmail.com

Thanks for tell me one of a discussion about this.

> To summarize, the answer I got was that it's pointless to create defenses
> against it inside the database.  It's on the users to create the
> constraints (or specify bounds) that are non-contradicting.

I just thought it's kind to tell users whether users mistakenly specify
bounds. 


> Interesting quotes from the above email:
>
> "If we allow partitioning on expressions, then it quickly becomes
> altogether impossible to deduce anything useful - unless you can solve
> the halting problem."
> 
> "... This patch is supposed to be implementing partitioning, not
> artificial intelligence."

It takes little more time to completely understand this interesting quotes,
but I guess I see that point.


Thanks again!

--
Yoshikazu Imai




RE: Why we allow CHECK constraint contradiction?

2018-10-09 Thread Imai, Yoshikazu
Thanks for replying!

On Tue, Oct 9, 2018 at 5:58 PM, Corey Huinker wrote:
> On Wed, Oct 10, 2018 at 1:44 AM David G. Johnston
> mailto:david.g.johns...@gmail.com> >
> wrote:
> 
> 
>   On Tuesday, October 9, 2018, Imai, Yoshikazu
> mailto:imai.yoshik...@jp.fujitsu.com>
> > wrote:
> 
>   Are there any rows which can satisfy the ct's CHECK
> constraint? If not, why we
>   allow creating table when check constraint itself is
> contradicted?
> 
> 
> 
>   I'd bet on it being a combination of complexity and insufficient
> expected benefit.  Time is better spent elsewhere.  Mathmatically
> proving a contradiction in software is harder than reasoning about it
> mentally.
> 
> 
> I've actually used that as a feature, in postgresql and other databases,
> where assertions were unavailable, or procedural code was unavailable
> or against policy.
> 
> Consider the following:
> 
> 
>   CREATE TABLE wanted_values ( x integer );
> 
>   INSERT INTO wanted_values VALUES (1), (2), (3);
> 
> 
> 
> 
>   CREATE TABLE found_values ( x integer );
> 
>   INSERT INTO found_values VALUES (1), (3);
> 
> 
> 
> 
>   CREATE TABLE missing_values (
> 
>   x integer,
> 
>   CONSTRAINT contradiction CHECK (false)
> 
>   );
> 
> 
> 
> 
>   INSERT INTO missing_values
> 
>   SELECT x FROM wanted_values
> 
>   EXCEPT
> 
>   SELECT x FROM found_values;
> 
> 
> 
> 
> gives the error
> 
> 
>   ERROR:  new row for relation "missing_values" violates check
> constraint "contradiction"
> 
>   DETAIL:  Failing row contains (2).
> 
> 
> Which can be handy when you need to fail a transaction because of bad
> data and don't have branching logic available.

That's an interesting using! So, there are useful case of constraint
contradiction table not only for time shortage/difficulties of
implementing mathematically proving a contradiction.

--
Yoshikazu Imai



Why we allow CHECK constraint contradiction?

2018-10-09 Thread Imai, Yoshikazu
Hi, all.

I have a wonder about the behaviour of creating table which has a constraint
contradiction.

I created below table.

bugtest=# create table ct (a int, CHECK(a is not null and a >= 0 and a < 100 
and a >= 200 and a < 300));
bugtest=# \d+ ct
Table "public.ct"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description 
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  | 
Check constraints:
"ct_a_check" CHECK (a IS NOT NULL AND a >= 0 AND a < 100 AND a >= 200 AND a 
< 300)


Are there any rows which can satisfy the ct's CHECK constraint? If not, why we
allow creating table when check constraint itself is contradicted?


I originally noticed this while creating partitioned range table as below.

bugtest=# create table rt (a int) partition by range (a);
bugtest=# create table rt_sub1 partition of rt for values from (0) to (100) 
partition by range (a);
bugtest=# create table rt_sub2 partition of rt for values from (100) to (200) 
partition by range (a);
bugtest=# create table rt150 partition of rt_sub1 for values from (150) to 
(151);
bugtest=# \d+ rt_sub1
  Table "public.rt_sub1"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description 
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  | 
Partition of: rt FOR VALUES FROM (0) TO (100)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 100))
Partition key: RANGE (a)
Partitions: rt150 FOR VALUES FROM (150) TO (151)

bugtest=# \d+ rt150
   Table "public.rt150"
 Column |  Type   | Collation | Nullable | Default | Storage | Stats target | 
Description 
+-+---+--+-+-+--+-
 a  | integer |   |  | | plain   |  | 
Partition of: rt_sub1 FOR VALUES FROM (150) TO (151)
Partition constraint: ((a IS NOT NULL) AND (a >= 0) AND (a < 100) AND (a IS NOT 
NULL) AND (a >= 150) AND (a < 151))


Any rows are not routed to rt150 through rt nor we can't insert any rows to 
rt150 directly because of its constraints. If we add check whether constraint
is contradicted, it prevent us from accidentally creating useless table like 
above rt150 which would not contain any rows.

I thought there might be a discussion or documentation about this, but I 
couldn't find it. If there is, please also tell me that.

Thanks,

--
Yoshikazu Imai





RE: Small performance tweak to run-time partition pruning

2018-10-08 Thread Imai, Yoshikazu
On Tue, Oct 9, 2018 at 2:02 AM, David Rowley wrote:
> Yeah, so subplan_map is just an array that stores the List index of
> the Append or MergeAppend's subplans. When we perform run-time pruning
> during executor initialisation, if we prune away some of these
> subplans then we don't initialise those pruned subplans at all which
> results in missing executor nodes for those plans. Instead of
> maintaining an array to store these with a bunch of NULLs in them to
> represent the pruned subnodes, for performance reasons, we make a
> gapless array and store them in there. This means that for the
> run-time pruning that we could do running actual execution
> (ExecFindMatchingSubPlans), the old subplan_map would be out of date,
> as it would contain the indexes of the subplans as if we'd not pruned
> any.  We can simply not bother adjusting the subplan_map if no further
> pruning is required. This could leave the map pointing to subplans
> that don't exist, but we only need to care about that when the map is
> actually going to be used for something. The good news is, we know in
> advance if the map will be used again.

Thanks for the detail explanation! I got it fully.

> > I saw the patch and it seems good to me about the codes.
> > I still couldn't check additional test cases in the patch.
> >
> >
> > BTW, when I looking the codes, I thought there is further improvements in
> > ExecInitAppend which is described below.
> >
> > j = i = 0;
> > firstvalid = nplans;
> > foreach(lc, node->appendplans)
> > {
> > if (bms_is_member(i, validsubplans))
> > {
> > Plan   *initNode = (Plan *) lfirst(lc);
> >
> > /*
> >  * Record the lowest appendplans index which is a 
> > valid partial
> >  * plan.
> >  */
> > if (i >= node->first_partial_plan && j < firstvalid)
> > firstvalid = j;
> >
> > appendplanstates[j++] = ExecInitNode(initNode, 
> > estate, eflags);
> > }
> > i++;
> > }
> >
> > If number of valid subplans is few compared to node->appendplans, it is a 
> > waste to check
> > bms_is_member(i, validsubplans) for all node->appendplans.
> > Of course, node->appendplans is List struct and we have to loop it to find 
> > valid plan,
> > that we can't simply use while(bms_next_member(i, validsubplans)) loop.
> > I don't have any good idea for it now, but can we improve it?
> 
> 
> 
> I do have other ideas for that but not ready to share code yet, but it
> basically requires reimplementing List to use arrays as their
> underlying implementation. This will allow the bms_next_member() type
> loop and list_nth() will be O(1) instead of O(N).

Great! 
So there might be little gain in using memory, but we get large performance 
improvement.

> It might be possible to squeeze a bit more performance out of this
> code with the current List implementation, but I it might actually
> slow performance in some cases, for example, if the only surviving
> partition was one of the last ones in the List.  Getting the final
> element with list_nth() is optimized, but if you consider a
> time-based, say monthly, RANGE partition, a DBA might maintain a
> partition ready for the next month of data, so it might be very common
> to access the 2nd last element in the list for all queries looking at
> "this months" data. In that case, list_nth(), in its current form, is
> as slow as can be.  

I understand accessing 2nd last element is slow in current List implementation,
but I don't know your new implementation is also slow in that case.
I don't know I might misunderstood "squeeze ~ out of ~ with ~" sentence.


> You'd also need to do a bms_num_members() or
> bms_get_singleton_member(), in order to decide if the alternative
> method is going to be any faster. That call is not going to be free.

Yeah, I need to use bms_num_members() or bms_get_singleton_member() to
check number of valid subplans is enough smaller than number of all append plans
to check alternative method is faster.

> It might also be possible to form the loop so that it calls
> bms_next_member() then store the result and loop until we reach that
> number. That would only save the bms_is_member call per loop, but I'd
> much rather do the array idea as it should also speed up plenty of
> other things.

Actually, it is possible refactor that loop with the method you described,
but it would be complex and hacky codes for only saving the wasting loop.

So, I also like your array idea.

--
Yoshikazu Imai



RE: Small performance tweak to run-time partition pruning

2018-10-08 Thread Imai, Yoshikazu
Hi, David.
Thanks for the patch!

On Mon, Oct 8, 2018 at 1:00 AM, David Rowley wrote:
> I was looking at this again and I realised that we can completely skip
> the re-sequence of the subplan map when we're not going to perform any
> further pruning during execution.

I checked codes and I think so too.

Confirmation of my understanding and For more information to others:

The subplan map is used when we call ExecFindInitialMatchingSubPlans or 
ExecFindMatchingSubPlans.
ExecFindInitialMatchingSubPlans is called by ExecInit(Merge)Append.
ExecFindmatchingSubPlans is called by below fours which is executed after
ExecInit(Merge)Append and it is called when the 
as_valid_subplans(or ms_valid_subplans) is not NULL.

1. choose_next_subplan_locally(AppendState *node)
2. choose_next_subplan_for_leader(AppendState *node)
3. choose_next_subplan_for_worker(AppendState *node)
4. ExecMergeAppend(PlanState *pstate)

The as_valid_subplans(or ms_valid_subplans) is set in ExecInit(Merge)Append
if pruning during execution is not required.

That's why we can completely skip the re-sequence of the subplan map 
when we're not going to perform any further pruning during execution.


> I've attached an updated patch which skips the re-sequence work when
> doing that is not required for anything.

I saw the patch and it seems good to me about the codes.
I still couldn't check additional test cases in the patch.


BTW, when I looking the codes, I thought there is further improvements in
ExecInitAppend which is described below. 

j = i = 0;
firstvalid = nplans;
foreach(lc, node->appendplans)
{
if (bms_is_member(i, validsubplans))
{
Plan   *initNode = (Plan *) lfirst(lc);

/*
 * Record the lowest appendplans index which is a valid 
partial
 * plan.
 */
if (i >= node->first_partial_plan && j < firstvalid)
firstvalid = j;

appendplanstates[j++] = ExecInitNode(initNode, estate, 
eflags);
}
i++;
}

If number of valid subplans is few compared to node->appendplans, it is a waste 
to check
bms_is_member(i, validsubplans) for all node->appendplans.
Of course, node->appendplans is List struct and we have to loop it to find 
valid plan,
that we can't simply use while(bms_next_member(i, validsubplans)) loop.
I don't have any good idea for it now, but can we improve it?


--
Yoshikazu Imai



RE: speeding up planning with partitions

2018-10-04 Thread Imai, Yoshikazu
Hi, Amit!

On Thu, Sept 13, 2018 at 10:29 PM, Amit Langote wrote:
> Attached is what I have at the moment.

I also do the code review of the patch.
I could only see a v3-0001.patch so far, so below are all about v3-0001.patch.

I am new to inheritance/partitioning codes and code review, so my review below 
might have some mistakes. If there are mistakes, please point out that kindly :)


v3-0001:
1. Is there any reason inheritance_make_rel_from_joinlist returns "parent" that 
is passed to it? I think we can refer parent in the caller even if 
inheritance_make_rel_from_joinlist returns nothing.

+static RelOptInfo *
+inheritance_make_rel_from_joinlist(PlannerInfo *root,
+  RelOptInfo 
*parent,
+  List 
*joinlist)
+{
...
+   return parent;
+}

2. Is there the possible case that IS_DUMMY_REL(child_joinrel) is true in 
inheritance_adjust_scanjoin_target()?

+inheritance_adjust_scanjoin_target(PlannerInfo *root,
...
+{
...
+   foreach(lc, root->inh_target_child_rels)
+   {
...
+   /*
+* If the child was pruned, its corresponding joinrel will be 
empty,
+* which we can ignore safely.
+*/
+   if (IS_DUMMY_REL(child_joinrel))
+   continue;

I think inheritance_make_rel_from_joinlist() doesn't make dummy joinrel for the 
child that was pruned.

+inheritance_make_rel_from_joinlist(PlannerInfo *root,
...
+{
...
+   foreach(lc, root->append_rel_list)
+   {
+   RelOptInfo *childrel;
...
+   /* Ignore excluded/pruned children. */
+   if (IS_DUMMY_REL(childrel))
+   continue;
...
+   /*
+* Save child joinrel to be processed later in
+* inheritance_adjust_scanjoin_target, which adjusts its paths 
to
+* be able to emit expressions in query's top-level target list.
+*/
+   root->inh_target_child_rels = 
lappend(root->inh_target_child_rels,
+   
  childrel);
+   }
+}

3.
Is the "root->parse->commandType != CMD_INSERT" required in:

@@ -2018,13 +1514,45 @@ grouping_planner(PlannerInfo *root, bool 
inheritance_update,
...
+   /*
+* For UPDATE/DELETE on an inheritance parent, we must generate 
and
+* apply scanjoin target based on targetlist computed using each
+* of the child relations.
+*/
+   if (parse->commandType != CMD_INSERT &&
+   current_rel->reloptkind == RELOPT_BASEREL &&
+   current_rel->relid == root->parse->resultRelation &&
+   root->simple_rte_array[current_rel->relid]->inh)
...

@@ -2137,92 +1665,123 @@ grouping_planner(PlannerInfo *root, bool 
inheritance_update,
final_rel->fdwroutine = current_rel->fdwroutine;
 
...
-   foreach(lc, current_rel->pathlist)
+   if (current_rel->reloptkind == RELOPT_BASEREL &&
+   current_rel->relid == root->parse->resultRelation &&
+   !IS_DUMMY_REL(current_rel) &&
+   root->simple_rte_array[current_rel->relid]->inh &&
+   parse->commandType != CMD_INSERT)


I think if a condition would be "current_rel->relid == 
root->parse->resultRelation && parse->commandType != CMD_INSERT" at the above 
if clause, elog() is called in the query_planner and planner don't reach above 
if clause.
Of course there is the case that query_planner returns the dummy joinrel and 
elog is not called, but in that case, current_rel->relid may be 0(current_rel 
is dummy joinrel) and root->parse->resultRelation may be not 0(a query is 
INSERT).

4. Can't we use define value(IS_PARTITIONED or something like IS_INHERITANCE?) 
to identify inheritance and partitioned table in below code? It was little 
confusing to me that which code is related to inheritance/partitioned when 
looking codes.

In make_one_rel():
+   if (root->parse->resultRelation &&
+   root->simple_rte_array[root->parse->resultRelation]->inh)
+   {
...
+   rel = inheritance_make_rel_from_joinlist(root, targetrel, 
joinlist);

In inheritance_make_rel_from_joinlist():
+   if (childrel->part_scheme != NULL)
+   childrel =
+   inheritance_make_rel_from_joinlist(root, 
childrel,
+   
   translated_joinlist);


I can't review inheritance_adjust_scanjoin_target() deeply, because it is 
difficult to me to understand fully codes about join processing.

--
Yoshikazu Imai



RE: Locking B-tree leafs immediately in exclusive mode

2018-07-26 Thread Imai, Yoshikazu
On Wed, July 25, 2018 at 0:30 AM, Alexander Korotkov wrote:
> > Lefted tasks in my review is doing the regression tests.
> 
> Cool, thank you for review!
> 

I did "make check-world" in patched version and all tests were passed 
successfully!


Yoshikazu Imai





RE: Locking B-tree leafs immediately in exclusive mode

2018-07-26 Thread Imai, Yoshikazu
Hi.
On Wed, July 25, 2018 at 0:28 AM, Alexander Korotkov wrote:
> Hi!
> 
> On Tue, Jul 10, 2018 at 4:39 PM 今井 良一  wrote:
> > On 2018/07/10 20:36, Alexander Korotkov wrote:
> > > Thank you for the experiments!  It seems that there is real regression
> > > here...  BTW, which script were you using in this benchmark:
> > > script_unordered.sql or script_duplicated.sql?
> >
> > Sorry, I forgot to write that. I used script_unordered.sql.
> 
> I've reread the thread.  And I found that in my initial letter [1] I
> forget to include index definition.
> CREATE INDEX unordered_i_idx ON unordered (i);
> 
> So, when you did experiments [2], did you define any index on
> unordered table?  If not, then it's curious why there is any
> performance difference at all.
> 
> 1. 
> https://www.postgresql.org/message-id/CAPpHfduAMDFMNYTCN7VMBsFg_hsf0GqiqXnt%2BbSeaJworwFoig%40mail.gmail.com
> 2. 
> https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51189451%40g01jpexmbkw24

My experiment[1][2] was totally wrong that I didn't create index in unordered 
and duplicated case.

On Mon, July 9, 2018 at 3:19 AM, I wrote:
> # script_duplicated.sql
> INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz');
I also made mistake in duplicated.sql that I used unordered table in duplicated 
experiment...

I re-run the experiments.
This time, I created indexes on unordered and duplicated table, and confirmed
indexes are created in both master and patched by using \d commands in psql.
I used AWS c5.18xlarge, and executed pgbench with 36 clients.

# results with indexes created
master,  ordered,Ave 252796 TPS (253122,252469) 
patched, ordered,Ave 314541 TPS (314011,315070)
master,  unordered,  Ave 409591 TPS (413901,412010,404912,404607,416494,405619) 
patched, unordered,  Ave 412765 TPS (409626,408578,416900,415955)
master,  duplicated, Ave 44811 TPS (45139,44482) 
patched, duplicated, Ave 202325 TPS (201519,203130)

The TPS of "master, ordered" and "patched, ordered" is similar to the TPS of 
"master, key1 with 1 values"
in previous experiment[3] (contention is concentrated), the TPS of "master, 
unordered" and "patched, unordered"
is similar to the TPS of "master, key1 with 100 values" in previous 
experiment[3] (contention is dispersed).
Therefore this experiment and previous experiment[3] asked for from Simon is 
correct I think.

The TPS of "patched, duplicated" was almost 4.5 times bigger than "master, 
duplicated" one.


About a small performance regression in previous experiments[1], I think it is 
because of machine's performance.
I don't have rational reason nor evidence, so I only describe what I thought 
and did (TL;DR:).

In this experiments, in order to increase machine's performance, I did 
something like "machine warm" as below batch.

## DDL
CREATE UNLOGGED TABLE improve_performance (i integer not null, value text not 
null);
CREATE INDEX improve_i_idx  on improve_performance (i);

## improve_performance.sql
\set i random(1, 100)
INSERT INTO improve_performance VALUES (:i, 'abcdefghijklmnopqrstuvwxyz');

## test batch
port=$1
client=$2
j=$3
tablenames_all=(ordered unordered duplicated ordered2)
tablenames=(unordered)

function cleaning() {
  for tablename_ in ${tablenames_all[@]}; do
psql -p ${port} -c "truncate table ${tablename_};"
  done

  psql -p ${port} -c "vacuum full analyze;"
}


pgbench -p ${port} -T 60 -P 1 -M prepared -f improve_performance.sql -c 72 -j 
36 postgres # do for warm
cleaning

for tablename in ${tablenames[@]}; do
  for i in `seq 0 1`; do
pgbench -p ${port} -T 60 -P 1 -M prepared -f script_${tablename}.sql -c 
${client} -j ${j} postgres`
cleaning
  done
done


When I created index on improve_performance table and executed test batch 
against unordered table
in patched version three times, results are below.
1. 379644,381409
2. 391089,391741
3. 383141,367851

These results are little decreased compared to "patched, key1 with 100 values: 
405115" (unordered)
in my previous experiments[3].

I thought if I execute pgbench for warm against table with index created, warm 
hasn't be completely done.
Because in experiments[3], I executed pgbench for warm against table without 
index created.

So I dropped index from improve_performance table and executed test batch 
against patched two times, results are below.
1. 409626,408578,
2. 416900,415955,

Even in the same test, performance differs largely, so a small regression can 
be ignored I think.
Of course, I wonder there were regressions in all cases at the experiments[2].


[1] 
https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51186E9D%40g01jpexmbkw24
[2] 
https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51189451@g01jpexmbkw24
[3] 
https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A5118C7C3%40g01jpexmbkw24


Yoshikazu Imai




RE: Locking B-tree leafs immediately in exclusive mode

2018-07-24 Thread Imai, Yoshikazu

On Mon, July 9, 2018 at 3:19 AM, Imai, Yoshikazu wrote:
> I'm planning to do code review and send it in the next mail.

Sorry for delaying the code review.

I did the code review, and I think there are no logical wrongs
with B-Tree.

I tested integrity of B-Tree with amcheck just in case.
I execute this test on 16-cores machine with 64GB mem.

I run B-Tree insertion and deletion simultaneously and execute 
'bt_index_parent_check' periodically.
I also run B-Tree insertion and update simultaneously and execute
'bt_index_parent_check' periodically.

##  DDL
create table mytab (val1 int, val2 int);
create index mytabidx on mytab (val1, val2);

## script_integrity_check_insert.sql
\set i random(1, 100)
\set j random(1, 100)
insert into mytab values (:i, :j);

## script_integrity_check_delete.sql
\set i random(1, 100)
\set j random(1, 100)
delete from mytab where val1 = :i and val2 = :j

## script_integrity_check_update.sql
\set i random(1, 100)
\set j random(1, 100)
\set k random(1, 100)
\set l random(1, 100)
update mytab set val1 = :k, val2 = :l where val1 = :i and val2 = :j

## commands(insert, delete, simultaneously executed)
pgbench -T 60 -P 1 -M prepared -f script_integrity_check_insert.sql -c 2 
postgres
pgbench -T 60 -P 1 -M prepared -f script_integrity_check_delete.sql -c 2 
postgres

## commands(insert, update, simultaneously executed)
pgbench -T 60 -P 1 -M prepared -f script_integrity_check_insert.sql -c 2 
postgres
pgbench -T 60 -P 1 -M prepared -f script_integrity_check_update.sql -c 2 
postgres

## commands(executed during above insert, delete, update)
(psql) select bt_index_parent_check('mytabidx');


Finally, I confirmed there are no integrity problems during B-Tree operation.

Lefted tasks in my review is doing the regression tests.


Yoshikazu Imai



RE: Locking B-tree leafs immediately in exclusive mode

2018-07-12 Thread Imai, Yoshikazu
On Mon, July 9, 2018 at 5:25 PM, Simon Riggs wrote:
> Please can you check insertion with the index on 2 keys
> 1st key has 10,000 values
> 2nd key has monotonically increasing value from last 1st key value
> 
> So each session picks one 1st key value
> Then each new INSERTion is a higher value of 2nd key
> so 1,1, then 1,2 then 1,3 etc
> 
> Probably easier to do this with a table like this
> 
> CREATE UNLOGGED TABLE ordered2 (id integer, logdate timestamp default
> now(), value text not null, primary key (id, logdate));
> 
> # script_ordered2.sql
> \set i random(1, 1)
> INSERT INTO ordered2 (id, value) VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');
> 
> Thanks
I tried to do this, but I might be mistaken your intention, so please specify 
if I am wrong.

While script_ordered.sql supposes that there is one contention point on the 
most right leaf node,
script_ordered2.sql supposes that there are some contention points on some leaf 
nodes, is it right?
I experimented with key1 having 1 values, but there are no difference in 
the results compared to unordered.sql one, so I experimented with key1 having 
1, 2, 3, 5, 10, and 100 values.
Also, If I created primary key, "ERROR:  duplicate key value violates unique 
constraint "ordered2_pkey" happened, so I created non-unique key.

#DDL
CREATE UNLOGGED TABLE ordered2 (id integer, logdate timestamp default now(), 
value text not null);
CREATE INDEX ordered2_key ON ordered2 (id, logdate);

# script_ordered2.sql
\set i random(1, 100)  #second value is 1, 2, 3, 5, 10, or 100
INSERT INTO ordered2 (id, value) VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');

# ordered2 results, key1 having 1, 2, 3, 5, 10, and 100 values
master,  key1 with 1 values:  236428
master,  key1 with 2 values:  292248
master,  key1 with 3 values:  340980
master,  key1 with 5 values:  362808
master,  key1 with 10 values: 379525
master,  key1 with 100 values: 405265

patched, key1 with 1 values:  295862
patched, key1 with 2 values:  339538
patched, key1 with 3 values:  355793
patched, key1 with 5 values:  371332
patched, key1 with 10 values: 387731
patched, key1 with 100 values: 405115


From an attached graph("some_contention_points_on_leaf_nodes.png"), as 
contention points dispersed, we can see that TPS is increased and TPS 
difference between master and patched version becomes smaller.


Yoshikazu Imai


RE: Locking B-tree leafs immediately in exclusive mode

2018-07-10 Thread Imai, Yoshikazu
On Mon, July 9, 2018 at 4:44 PM, Alexander Korotkov wrote:
> > Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same 
> > as you did.
> 
> OK.  But not that c5.18xlarge is 72-VCPU machine, which AFAIK is
> close to the performance of 36 physical cores.

Thanks for pointing that. I referred to /proc/cpuinfo and understood there are 
36 physical cores. 

> In this case it also looks like we observed 1% regression.  Despite 1%
> may seem to be very small, I think we should clarify whether it really
> exists.  I have at least two hypothesis about this.
> 
> 1) There is no real regression, observed difference of TPS is less
> than error of measurements.  In order to check that we need to retry
> the experiment multiple times.  Also, if you run benchmark on master
> before patched version (or vice versa) you should also try to swap the
> order to make sure there is no influence of the order of benchmarks.
> 2) If we consider relation between TPS and number of clients, TPS is
> typically growing with increasing number of clients until reach some
> saturation value.  After the saturation value, there is some
> degradation of TPS.  If patch makes some latency lower, that my cause
> saturation to happen earlier.  In order to check that, we need run
> benchmarks with various number of clients and draw a graph: TPS
> depending on clients.
> 
> So, may I ask you to make more experiments in order to clarify the
> observed regression?

I experimented 2) with changing clients parameter with 18, 36, 54, 72. 
While doing experiment, I realized that results of pgbench with 36 clients 
improve after executing pgbench with 72 clients.
I don't know why this occurs, but anyway, in this experiment, I executed 
pgbench with 72 clients before executing other pgbenchs. (e.g. -c 72, -c 18, -c 
36, -c 54, -c 72)
I tested experiments to master and patched unorderly(e.g. master, patched, 
patched, master, master, patched, patched, master)

# results of changing clients(18, 36, 54, 72 clients)
master, -c 18 -j 18:  Ave 400410 TPS (407615,393942,401845,398241) 
master, -c 36 -j 36:  Ave 415616 TPS (411939,400742,424855,424926)
master, -c 54 -j 54:  Ave 378734 TPS (401646,354084,408044,351163)
master, -c 72 -j 72:  Ave 360864 TPS (367718,360029,366432,349277)
patched, -c 18 -j 18: Ave 393115 TPS (382854,396396,395530,397678)
patched, -c 36 -j 36: Ave 390328 TPS (376100,404873,383498,396840)
patched, -c 54 -j 54: Ave 364894 TPS (365533,373064,354250,366727)
patched, -c 72 -j 72: Ave 353982 TPS (355237,357601,345536,357553)

It may seem saturation is between 18 and 36 clients, so I also experimented 
with 27 clients.

# results of changing clients(27 clients)
master, -c 27 -j 27:  Ave 416756 (423512,424241,399241,420030) TPS
patched, -c 27 -j 27: Ave 413568 (410187,404291,420152,419640) TPS

I created a graph and attached in this mail("detecting saturation.png").
Referring to a graph, patched version's saturation happens earlier than 
master's one as you expected.
But even the patched version's nearly saturated TPS value has small regression 
from the master's one, I think.

Is there another experiments to do about this?


Yoshikazu Imai




RE: Locking B-tree leafs immediately in exclusive mode

2018-07-08 Thread Imai, Yoshikazu
On Mon, June 11, 2018 at 4:31 PM, Alexander Korotkov wrote:
> On Mon, Jun 11, 2018 at 1:06 PM Simon Riggs  wrote:
> > On 5 June 2018 at 14:45, Alexander Korotkov 
> > wrote:
> > > Currently _bt_search() always locks leaf buffer in shared mode (aka
> > > BT_READ), while caller can relock it later.  However, I don't see
> > > what prevents
> > > _bt_search()
> > > from locking leaf immediately in exclusive mode (aka BT_WRITE) when
> > > required.
> > > When traversing downlink from non-leaf page of level 1 (lowest level
> > > of non-leaf pages just prior to leaf pages), we know that next page
> > > is going to be leaf.  In this case, we can immediately lock next page
> > > in exclusive mode if required.
> > > For sure, it might happen that we didn't manage to exclusively lock
> > > leaf in this way when _bt_getroot() points us to leaf page.  But
> > > this case could be handled in _bt_search() by relock.  Please, find
> > > implementation of this approach in the attached patch.
> >
> > It's a good idea. How does it perform with many duplicate entries?
> 
> Our B-tree is currently maintaining duplicates unordered.  So, during
> insertion we can traverse rightlinks in order to find page, which would
> fit new index tuple.
> However, in this case we're traversing pages in exclusive lock mode, and
> that happens already after re-lock.  _bt_search() doesn't do anything
> with that.
> So, I assume there shouldn't be any degradation in the case of many
> duplicate entries.
> 
> But this case definitely worth testing, and I'm planning to do it.
> 


Hi, I'm reviewing this.

Firstly, I did performance tests on 72-cores machines(AWS c5.18xlarge) same as 
you did.

> # postgresql.conf
> max_connections = 300
> shared_buffers = 32GB
> fsync = off
> synchronous_commit = off
> 
> 
> 
> #  DDL:
> CREATE UNLOGGED TABLE ordered (id serial primary key, value text not null);
> CREATE UNLOGGED TABLE unordered (i integer not null, value text not null);
> 
> 
> 
> # script_ordered.sql
> INSERT INTO ordered (value) VALUES ('abcdefghijklmnoprsqtuvwxyz');
> 
> 
> 
> # script_unordered.sql
> \set i random(1, 100)
> INSERT INTO unordered VALUES (:i, 'abcdefghijklmnoprsqtuvwxyz');
> 
> 
> 
> # commands
> pgbench -T 60 -P 1 -M prepared -f script_ordered.sql -c 150 -j 150 postgres
> pgbench -T 60 -P 1 -M prepared -f script_unordered.sql -c 150 -j 150 postgres
> 
> 
> 
> # results
> ordered, master: 157473 TPS
> ordered, patched 231374 TPS
> unordered, master: 232372 TPS
> unordered, patched: 232535 TPS

# my results
ordered, master: 186045 TPS
ordered, patched:265842 TPS
unordered, master:   313011 TPS
unordered, patched:  309636 TPS

I confirmed that this patch improves ordered insertion.


In addition to your tests, I did the following tests with many duplicate entries

#  DDL
CREATE UNLOGGED TABLE duplicated (i integer not null, value text not null);

# script_duplicated.sql
INSERT INTO unordered VALUES (1, 'abcdefghijklmnoprsqtuvwxyz');

# commands
pgbench -T 60 -P 1 -M prepared -f script_duplicated.sql -c 150 -j 150 postgres

# results
duplicated, master:  315450 TPS
duplicated, patched: 311584 TPS

It seems there are almostly no performance degression in case of many 
duplicated entries.


I'm planning to do code review and send it in the next mail.

Yoshikazu Imai




Re:Re: some question about _bt_getbuf

2018-06-26 Thread Imai, Yoshikazu
Hi,

> At 2018-05-15 01:49:41, "Tom Lane"  wrote:
> >=?GBK?B?19S8ug==?=  writes:
> >> i run test using pg10.0 on my machine, and the program crashed on 
> >> _bt_getbuf.
> >> And i found the following code:
> >> the routine _bt_page_recyclable say maybe the page is all-zero page,
> >> if so then the code run (BTPageOpaque) PageGetSpecialPointer(page);
> >> it will be failed because it access invalid memory.
> >> I don't know whether it is so. Look forward t your reply, thanks.
> >
> >This code's clearly broken, as was discussed before:
> >
> >https://www.postgresql.org/message-id/flat/2628.1474272158%40localhost
> >
> >but nothing was done about it, perhaps partly because we didn't have a
> >reproducible test case.  Do you have one?
> >
> > regards, tom lane
> 
> Unfortunately, I don't have a complete test case.

I recently checked about this code and previous discussion and tried to occur a 
crash.
I will describe how to occur a crash in the last of this mail, but I don't know 
whether it is useful because I used gdb to occur a crash, that it is not 
actually a reproducible test case.

As was discussed before, this crash happens when recycling an all-zeroes page 
in an index.
Referring to below comments in code, an all-zeroes page is created when backend 
downs in the split process after extending the index's relation to get a new 
page and before making WAL entries for that.

bool
_bt_page_recyclable(Page page)
{
BTPageOpaque opaque;

/*
 * It's possible to find an all-zeroes page in an index --- for 
example, a
 * backend might successfully extend the relation one page and 
then crash
 * before it is able to make a WAL entry for adding the page. 
If we find a
 * zeroed page then reclaim it.
 */
if (PageIsNew(page))
return true;
...
}

After backend down at that time, an extended new page is not initialized since 
a recovery process after a backend down do nothing because of no WAL entry 
about a new page, and it will be recyclable when vacuum runs. 


Considering above conditions, I reproduced a crash as below.
I tested at version in master(11beta1), compiled with --enable-cassert and 
--enable-debug, with hot-standby.

<>
(psql) CREATE TABLE mytab (id int, val int);
(psql) CREATE INDEX idx_val ON mytab(val);
(gdb) b nbtinsert.c:1467   (at XLogBeginInsert(); in _bt_split())
(gdb) c
while(breakpoint is not hit){
(psql) INSERT INTO mytab SELECT t, t FROM generate_series(1, 3000) t;
}
[bash] kill -s SIGKILL (backend pid)
(psql) VACUUM;

<>
while(crash is not occurred){
(psql) INSERT INTO mytab SELECT t, t FROM generate_series(1, 3000) t;
}


Yoshikazu Imai