RE: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Kato, Sho
On 2018/07/16 13:16, Tsunakawa-san wrote:
>Thanks.  And what does the comparison look like between the unpartitioned case 
>and various partition counts?  What's the performance characteristics in terms 
>of the latency and partition count?   I thought that's what you tried to 
>reveal first?

In unpartitioned table, latency of SELECT/UPDATE statement is close to O(n), 
where n is number of records.
Latency of INSERT statements is close to O(1).

In partitioned table, up to 400 partitions, latency of SELECT/INSERT statement 
is close to O(log n), where n is the number of partitions.
Between 400 and 6400 partitions, latency is close to O(n).
Up to 400 partitions, latency of UPDATE statement is close to O(n).
Between 400 and 6400 partitions, latency of UPDATE statement seems to O(n^2).

Details are as follows.

unpartitioned table result (prepared mode)
--

 scale | latency_avg |   tps_ex| update_latency | select_latency | 
insert_latency 
---+-+-+++
   100 |0.24 | 4174.395738 |  0.056 |  0.051 |  
 0.04
   200 |   0.258 | 3873.099014 |  0.065 |  0.059 |  
 0.04
   400 |0.29 | 3453.171112 |  0.081 |  0.072 |  
0.041
   800 |   0.355 | 2814.936942 |  0.112 |  0.105 |  
0.041
  1600 |   0.493 | 2027.702689 |   0.18 |  0.174 |  
0.042
  3200 |   0.761 | 1313.532458 |  0.314 |  0.307 |  
0.043
  6400 |   1.214 |  824.001431 |   0.54 |  0.531 |  
0.043


partitioned talble result (prepared mode)
-

 num_part | latency_avg |   tps_ex| update_latency | select_latency | 
insert_latency 
--+-+-+++
  100 |   0.937 | 1067.473258 |  0.557 |  0.087 |   
   0.123
  200 |1.65 |  606.244552 |  1.115 |  0.121 |   
   0.188
  400 |   3.295 |  303.491681 |  2.446 |   0.19 |   
   0.322
  800 |   8.102 |  123.422456 |  6.553 |  0.337 |   
   0.637
 1600 |  36.996 |   27.030027 | 31.528 |  1.621 |   
   2.455
 3200 | 147.998 |6.756922 |136.136 |   4.21 |   
4.94
 6400 | 666.947 |1.499383 |640.879 |  7.631 |   
   9.642

regards,
-Original Message-
From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] 
Sent: Monday, July 16, 2018 1:16 PM
To: Kato, Sho/加藤 翔 
Cc: 'Amit Langote' ; PostgreSQL mailing lists 

Subject: RE: How to make partitioning scale better for larger numbers of 
partitions

From: Kato, Sho [mailto:kato-...@jp.fujitsu.com]
> I did pgbench -M prepared and perf record.
> 
> UPDATE latency in prepared mode is 95% shorter than in simple mode.
> SELECT latency in prepared mode is 54% shorter than in simple mode.
> INSERT latency in prepared mode is 8% shorter than in simple mode.

Thanks.  And what does the comparison look like between the unpartitioned case 
and various partition counts?  What's the performance characteristics in terms 
of the latency and partition count?   I thought that's what you tried to reveal 
first?

Regards
Takayuki Tsunakawa








Re: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Amit Langote
On 2018/07/17 12:14, Ashutosh Bapat wrote:
> On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho  wrote:
>> On 2018/07/17 10:49, Amit Langote wrote:
>>> Perhaps, Kato-san only intended to report that the time that planner spends 
>>> for a partitioned table with 1100 partitions is just too high compared to 
>>> the time it spends on a non-partitioned table.
>>
>> yes, It is included for the purposes of this comparison.
>>
>> The purpose of this comparison is to find where the partitioning bottleneck 
>> is.
>> Using the bottleneck as a hint of improvement, I'd like to bring the 
>> performance of partitioned table closer to the performance of unpartitioned 
>> table as much as possible.
>>
> 
> That's a good thing, but may not turn out to be realistic. We should
> compare performance where partitioning matters and try to improve in
> those contexts. Else we might improve performance in scenarios which
> are never used.
> 
> In this case, even if we improve the planning time by 100%, it hardly
> matters since planning time is neglegible compared to the execution
> time because of huge data where partitioning is useful.

While I agree that it's a good idea to tell users to use partitioning only
if the overhead of having the partitioning in the first place is bearable,
especially the planner overhead, this benchmark shows us that even for
what I assume might be fairly commonly occurring queries (select .. from /
update .. / delete from partitioned_table where partkey = ?), planner
spends way too many redundant cycles.  Some amount of that overhead will
always remain and planning with partitioning will always be a bit slower
than without partitioning, it's *too* slow right now for non-trivial
number of partitions.

Thanks,
Amit




Re: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Ashutosh Bapat
On Tue, Jul 17, 2018 at 8:31 AM, Kato, Sho  wrote:
> On 2018/07/17 10:49, Amit Langote wrote:
>>Perhaps, Kato-san only intended to report that the time that planner spends 
>>for a partitioned table with 1100 partitions is just too high compared to the 
>>time it spends on a non-partitioned table.
>
> yes, It is included for the purposes of this comparison.
>
> The purpose of this comparison is to find where the partitioning bottleneck 
> is.
> Using the bottleneck as a hint of improvement, I'd like to bring the 
> performance of partitioned table closer to the performance of unpartitioned 
> table as much as possible.
>

That's a good thing, but may not turn out to be realistic. We should
compare performance where partitioning matters and try to improve in
those contexts. Else we might improve performance in scenarios which
are never used.

In this case, even if we improve the planning time by 100%, it hardly
matters since planning time is neglegible compared to the execution
time because of huge data where partitioning is useful.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



RE: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Kato, Sho
On 2018/07/17 10:49, Amit Langote wrote:
>Perhaps, Kato-san only intended to report that the time that planner spends 
>for a partitioned table with 1100 partitions is just too high compared to the 
>time it spends on a non-partitioned table.

yes, It is included for the purposes of this comparison.

The purpose of this comparison is to find where the partitioning bottleneck is.
Using the bottleneck as a hint of improvement, I'd like to bring the 
performance of partitioned table closer to the performance of unpartitioned 
table as much as possible.

-Original Message-
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] 
Sent: Tuesday, July 17, 2018 10:49 AM
To: Ashutosh Bapat ; Kato, Sho/加藤 翔 

Cc: Justin Pryzby ; PostgreSQL mailing lists 

Subject: Re: How to make partitioning scale better for larger numbers of 
partitions

On 2018/07/13 22:10, Ashutosh Bapat wrote:
> On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho  wrote:
>>> I wondered if you compared to PG10 or to inheritence-partitioning (parent 
>>> with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into 
>>> child) ?
>>
>> Thank you for your reply.
>>
>> I compared to PG11beta2 with non-partitioned table.
>>
>> Non-partitioned table has 1100 records in one table.
>> Partitioned table has one record on each leaf partitions.
>>
> 
> I don't think partitioning should be employed this way even for the 
> sake of comparison. Depending upon the size of each tuple, 1100 tuples 
> are inserted into a single table, they will probably occupy few 
> hundred pages. In a partitioned table with one tuple per partition 
> they will occupy 1100 pages at least. There is other space, locking 
> overheads to maintain 1100 tables. I think the right way to compare is 
> to have really large that which really requires 1100 partitions and 
> then compare performance by putting that data in 1100 partitions and 
> in an unpartitioned table. Even with that kind of data, you will see 
> some difference in performance, but that won't be as dramatic as you 
> report.
> 
> I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner spends for 
a partitioned table with 1100 partitions is just too high compared to the time 
it spends on a non-partitioned table.  It is and has been clear that that's 
because the planning time explodes as the number of partitions increases.

If there's lots of data in it, then the result will look completely different 
as you say, because scanning a single partition (of the 1100
total) will spend less time than scanning a non-partitioned table containing 
1100 partitions worth of data.  But the planning time would still be more for 
the partitioned table, which seems to be the point of this benchmark.

Thanks,
Amit





Re: How to make partitioning scale better for larger numbers of partitions

2018-07-16 Thread Amit Langote
On 2018/07/13 22:10, Ashutosh Bapat wrote:
> On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho  wrote:
>>> I wondered if you compared to PG10 or to inheritence-partitioning (parent 
>>> with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into 
>>> child) ?
>>
>> Thank you for your reply.
>>
>> I compared to PG11beta2 with non-partitioned table.
>>
>> Non-partitioned table has 1100 records in one table.
>> Partitioned table has one record on each leaf partitions.
>>
> 
> I don't think partitioning should be employed this way even for the
> sake of comparison. Depending upon the size of each tuple, 1100 tuples
> are inserted into a single table, they will probably occupy few
> hundred pages. In a partitioned table with one tuple per partition
> they will occupy 1100 pages at least. There is other space, locking
> overheads to maintain 1100 tables. I think the right way to compare is
> to have really large that which really requires 1100 partitions and
> then compare performance by putting that data in 1100 partitions and
> in an unpartitioned table. Even with that kind of data, you will see
> some difference in performance, but that won't be as dramatic as you
> report.
> 
> I might be missing something though.

Perhaps, Kato-san only intended to report that the time that planner
spends for a partitioned table with 1100 partitions is just too high
compared to the time it spends on a non-partitioned table.  It is and has
been clear that that's because the planning time explodes as the number of
partitions increases.

If there's lots of data in it, then the result will look completely
different as you say, because scanning a single partition (of the 1100
total) will spend less time than scanning a non-partitioned table
containing 1100 partitions worth of data.  But the planning time would
still be more for the partitioned table, which seems to be the point of
this benchmark.

Thanks,
Amit




RE: How to make partitioning scale better for larger numbers of partitions

2018-07-15 Thread Tsunakawa, Takayuki
From: Kato, Sho [mailto:kato-...@jp.fujitsu.com]
> I did pgbench -M prepared and perf record.
> 
> UPDATE latency in prepared mode is 95% shorter than in simple mode.
> SELECT latency in prepared mode is 54% shorter than in simple mode.
> INSERT latency in prepared mode is 8% shorter than in simple mode.

Thanks.  And what does the comparison look like between the unpartitioned case 
and various partition counts?  What's the performance characteristics in terms 
of the latency and partition count?   I thought that's what you tried to reveal 
first?

Regards
Takayuki Tsunakawa





Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Ashutosh Bapat
On Fri, Jul 13, 2018 at 9:23 AM, Kato, Sho  wrote:
>>I wondered if you compared to PG10 or to inheritence-partitioning (parent 
>>with relkind='r' and either trigger or rule or >INSERT/UPDATE directly into 
>>child) ?
>
> Thank you for your reply.
>
> I compared to PG11beta2 with non-partitioned table.
>
> Non-partitioned table has 1100 records in one table.
> Partitioned table has one record on each leaf partitions.
>

I don't think partitioning should be employed this way even for the
sake of comparison. Depending upon the size of each tuple, 1100 tuples
are inserted into a single table, they will probably occupy few
hundred pages. In a partitioned table with one tuple per partition
they will occupy 1100 pages at least. There is other space, locking
overheads to maintain 1100 tables. I think the right way to compare is
to have really large that which really requires 1100 partitions and
then compare performance by putting that data in 1100 partitions and
in an unpartitioned table. Even with that kind of data, you will see
some difference in performance, but that won't be as dramatic as you
report.

I might be missing something though.

-- 
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company



RE: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Kato, Sho
Tsunakawa-san

>Kato-san, could you try pgbench -M prepared?

I did pgbench -M prepared and perf record.

UPDATE latency in prepared mode is 95% shorter than in simple mode. 
SELECT latency in prepared mode is 54% shorter than in simple mode.
INSERT latency in prepared mode is 8% shorter than in simple mode.

In perf report, AllocSetAlloc, hash_search_with_hash_value and hash_any 
appeared in all SQL.

Details are as follows.

pgbench results
--

transaction type: update.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 12295
latency average = 14.641 ms
tps = 68.302806 (including connections establishing)
tps = 68.303430 (excluding connections establishing)
statement latencies in milliseconds:
 0.009  \set aid random(1, 1100 * 1)
 0.004  \set delta random(-5000, 5000)
 0.026  BEGIN;
14.089  UPDATE test.accounts SET abalance = abalance + :delta WHERE aid 
= :aid;
 0.513  END;

transaction type: select.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 45145
latency average = 3.996 ms
tps = 250.272094 (including connections establishing)
tps = 250.274404 (excluding connections establishing)
statement latencies in milliseconds:
 0.750  \set aid random(1, 1100 * 1)
 0.714  \set delta random(-5000, 5000)
 0.023  BEGIN;
 2.262  SELECT abalance FROM test.accounts WHERE aid = :aid;
 0.247  END;

transaction type: insert.sql
scaling factor: 1
query mode: prepared
number of clients: 1
number of threads: 1
duration: 180 s
number of transactions actually processed: 34894
latency average = 5.158 ms
tps = 193.855216 (including connections establishing)
tps = 193.857020 (excluding connections establishing)
statement latencies in milliseconds:
 1.007  \set aid random(1, 1100 * 1)
 0.802  \set delta random(-5000, 5000)
 0.025  BEGIN;
 2.963  INSERT INTO test.accounts_history (aid, delta, mtime) VALUES 
(:aid, :delta, CURRENT_TIMESTAMP);
 0.360  END;

Top 10 of perf report
-
UPDATE:
 11.86%  postgres  postgres   [.] list_nth
 10.23%  postgres  postgres   [.] ExecOpenScanRelation
  7.47%  postgres  postgres   [.] list_member_int
  4.78%  postgres  postgres   [.] AllocSetAlloc
  3.61%  postgres  postgres   [.] palloc0
  3.09%  postgres  postgres   [.] hash_search_with_hash_value
  2.33%  postgres  postgres   [.] ResourceArrayAdd
  1.99%  postgres  postgres   [.] hash_any
  1.83%  postgres  postgres   [.] copyObjectImpl
  1.64%  postgres  postgres   [.] SearchCatCache1

SELECT:
 33.60%  postgres  postgres   [.] hash_search_with_hash_value
 13.02%  postgres  postgres   [.] hash_any
  5.30%  postgres  postgres   [.] LockAcquireExtended
  5.16%  postgres  postgres   [.] LockReleaseAll
  4.00%  postgres  postgres   [.] hash_seq_search
  3.84%  postgres  postgres   [.] LWLockRelease
  3.81%  postgres  postgres   [.] AllocSetAlloc
  3.23%  postgres  postgres   [.] LWLockAcquire
  2.55%  postgres  postgres   [.] SetupLockInTable
  1.90%  postgres  postgres   [.] AcquireExecutorLocks

INSERT:
 21.86%  postgres  postgres   [.] hash_search_with_hash_value
  6.36%  postgres  postgres   [.] hash_any
  4.95%  postgres  postgres   [.] AllocSetAlloc
  4.08%  postgres  postgres   [.] LWLockRelease
  3.83%  postgres  postgres   [.] LWLockAcquire
  3.26%  postgres  postgres   [.] SearchCatCache
  3.15%  postgres  postgres   [.] oid_cmp
  2.78%  postgres  postgres   [.] LockReleaseAll
  2.76%  postgres  postgres   [.] pg_qsort
  2.41%  postgres  postgres   [.] SearchCatCache1


-Original Message-
From: Tsunakawa, Takayuki/綱川 貴之 
Sent: Friday, July 13, 2018 2:49 PM
To: 'Amit Langote' ; Kato, Sho/加藤 翔 
; PostgreSQL mailing lists 

Subject: RE: How to make partitioning scale better for larger numbers of 
partitions

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
> is pretty significant and gets worse as the number of partitions grows.
> I
> had intended to fix that in PG 11, but we could only manage to get part
> of
> that work into PG 11, with significant help from David and others.  So,
> while PG 11's overhead of partitioning during planning is less than PG
> 10's due to improved pruning algorithm, it's still pretty far from ideal,
> because it isn't just the pruning algorithm that had overheads.  In fact,
> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
> carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which 

Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Amit Langote
On 2018/07/13 16:29, Kato, Sho wrote:
> I also benchmark PG10.
> Actually, SELECT latency on PG11beta2 + patch1 is faster than PG10.
> 
> SELECT latency with 800 leaf partition
> --
> PG10 5.62 ms
> PG11  3.869 ms  
> 
> But, even PG11, SELECT statement takes 21.102ms on benchmark with 1600 leaf 
> partitions.
> It takes a long time though partition pruning algorithm of PG11 is binary 
> search.

Yeah, pruning algorithm change evidently removed only a small percentage
of the overhead.

>> The overheads I mention stem from the fact that for partitioning we still 
>> rely on the old planner code that's used to perform inheritance planning, 
>> which requires to lock and open *all* partitions.
> 
> I debug update statement execution on partitioned table.
> range_table_mutator seems process all leaf partitions.

That's one of the the symptoms of it, yes.

With the existing code for UPDATE/DELETE planning of partitioned tables,
the whole Query structure is modified to translate the parent's attribute
numbers to partition attribute numbers and planned freshly *for every
partition*.  range_table_mutator() is invoked sometime during that
translation process and itself loops over all partitions, so I'd think it
is susceptible to being prominently visible in perf profiles.

Thanks,
Amit




RE: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Tsunakawa, Takayuki
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> The immediate one I think is to refactor the planner such that the new
> pruning code, that we were able to utilize for SELECT in PG 11, can also
> be used for UPDATE/DELETE.  Refactoring needed to replace the pruning
> algorithm was minimal for SELECT, but not so much for UPDATE/DELETE.
> 
> Once we're able to utilize the new pruning algorithm for all the cases,
> we
> can move on to refactoring to avoid locking and opening of all partitions.
>  As long as we're relying on constraint exclusion for partition pruning,
> which we still do for UPDATE/DELETE, we cannot do that because constraint
> exclusion has to look at each partition individually.
> 
> The UPDATE/DELETE planning for partitioning using huge memory and CPU is
> a
> pretty big issue and refactoring planner to avoid that may be what's
> hardest of all the problems to be solved here.

Thank you.  There seem to be many challenges to address...  As a user and PG 
developer, I'd be happy to see some wiki page that describes the current 
performance characteristics in terms of # of partitions, the ideal and 
reachable performance, and the challenges to overcome to reach that ideal goal.


> If the benchmark contains queries that will need to access just one
> partition, then yes the planning part has is the biggest overhead.
> 
> Execution-time overhead is limited to having an extra, possibly needless,
> Append node, but I know David has patch for that too.

That's good news, thanks.  Our user will perform around a hundred single-row 
INSERT/SELECT/UPDATE/DELETE statements in each transaction, and those are 
PREPAREd.  I hope PG 11 (with David's patches) will work well for that 
workload.  I'll wait for Kato-san's pgbench -M prepared result.

Regards
Takayuki Tsunakawa







Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread David Rowley
On 13 July 2018 at 18:53, Tsunakawa, Takayuki
 wrote:
> By the way, what do you think is the "ideal and should-be-feasible" goal and 
> the "realistic" goal we can reach in the near future (e.g. PG 12)?  Say,

Depends. Patched don't move that fast without review and nothing gets
committed without a committer.

> * Planning and execution time is O(log n), where n is the number of partitions
> * Planning time is O(log n), execution time is O(1)
> * Planning and execution time is O(1), where n is the number of partitions

It's going to depend on how many partitions are pruned. We still need
to generate paths for all non-pruned partitions which is going to be
slow when there are many partitions.

I think we can get pretty close to the non-partitioned planning
performance with SELECT/UPDATE/DELETE when all but 1 partition
survives pruning. There are always going to be some additional
overhead we can't get rid of, but hopefully, those will be small.

Please feel free to review what I have in the July 'fest.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



RE: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Tsunakawa, Takayuki
From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > David has submitted multiple patches for PG 12, one of which speeds up
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)
> What challenges are there for future versions, and which of them are being
> addressed by patches in progress for PG 12, and which issues are untouched?
> 
> I've not submitted that for PG12 yet. I had other ideas about just
> getting rid of the inheritance planner altogether, but so far don't
> have a patch for that. Still uncertain if there are any huge blockers
> to that either.

Sorry, I seem to have misunderstood something.

By the way, what do you think is the "ideal and should-be-feasible" goal and 
the "realistic" goal we can reach in the near future (e.g. PG 12)?  Say,

* Planning and execution time is O(log n), where n is the number of partitions
* Planning time is O(log n), execution time is O(1)
* Planning and execution time is O(1), where n is the number of partitions

Regards
Takayuki Tsunakawa




Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Amit Langote
On 2018/07/13 14:49, Tsunakawa, Takayuki wrote:
> From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
>> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
>> is pretty significant and gets worse as the number of partitions grows.
>> I
>> had intended to fix that in PG 11, but we could only manage to get part
>> of
>> that work into PG 11, with significant help from David and others.  So,
>> while PG 11's overhead of partitioning during planning is less than PG
>> 10's due to improved pruning algorithm, it's still pretty far from ideal,
>> because it isn't just the pruning algorithm that had overheads.  In fact,
>> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
>> carry the overhead that was in PG 10.
> 
> David has submitted multiple patches for PG 12, one of which speeds up 
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)

I don't think he has posted a new patch for improving the speed for
UDPATE/DELETE planning yet, although I remember he had posted a PoC patch
back in February or March.

> What challenges are there for future versions, and which of them are being 
> addressed by patches in progress for PG 12, and which issues are untouched?

The immediate one I think is to refactor the planner such that the new
pruning code, that we were able to utilize for SELECT in PG 11, can also
be used for UPDATE/DELETE.  Refactoring needed to replace the pruning
algorithm was minimal for SELECT, but not so much for UPDATE/DELETE.

Once we're able to utilize the new pruning algorithm for all the cases, we
can move on to refactoring to avoid locking and opening of all partitions.
 As long as we're relying on constraint exclusion for partition pruning,
which we still do for UPDATE/DELETE, we cannot do that because constraint
exclusion has to look at each partition individually.

The UPDATE/DELETE planning for partitioning using huge memory and CPU is a
pretty big issue and refactoring planner to avoid that may be what's
hardest of all the problems to be solved here.

>> The overheads I mention stem from
>> the fact that for partitioning we still rely on the old planner code
>> that's used to perform inheritance planning, which requires to lock and
>> open *all* partitions.  We have so far been able to refactor just enough
>> to use the new code for partition pruning, but there is much refactoring
>> work left to avoid needlessly locking and opening all partitions.
> 
> I remember the issue of opening and locking all partitions was discussed 
> before.

Quite a few times I suppose.

> Does this relate only to the planning phase?

If the benchmark contains queries that will need to access just one
partition, then yes the planning part has is the biggest overhead.

Execution-time overhead is limited to having an extra, possibly needless,
Append node, but I know David has patch for that too.

Thanks,
Amit




Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread David Rowley
On 13 July 2018 at 14:58, Kato, Sho  wrote:
> Of course I'm sure table partitioning work well with up to a hundred
> partitions as written on the postgresql document.
>
> But, my customer will use partitioned table with 1.1k leaf partitions.
>
> So, we need to improve performance.
>
> Any ideas?

It would be really good if you could review and test my partitioning
patches in the current commitfest. These are the ones authored by me
with the word "partition" in the title.

These 4 patches don't solve all the problems, but they do make some
good gains in some areas. I've still more work to do, so the earlier I
can be finished with the 4 that are there now, the more I can focus on
the rest.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread David Rowley
On 13 July 2018 at 17:49, Tsunakawa, Takayuki
 wrote:
> David has submitted multiple patches for PG 12, one of which speeds up 
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)  
> What challenges are there for future versions, and which of them are being 
> addressed by patches in progress for PG 12, and which issues are untouched?

I've not submitted that for PG12 yet. I had other ideas about just
getting rid of the inheritance planner altogether, but so far don't
have a patch for that. Still uncertain if there are any huge blockers
to that either. Join search needs performed multiple times, but a good
advantage will be that we can take advantage of partition pruning to
only join search the non-pruned partitions.

The reason I change my mind about the patch you're talking about is
that it meant that RangeTblEntries needed to keep the same relation id
in the inheritance planner as they did in the grouping planner, and
another patch I have semi-baked delays building both RelOptInfo and
RangeTblEntry records for partitions until after pruning. Without the
RangeTblEntry it was difficult to ensure the relids were in lock-step
between the two planners.  There are ways to do it, it just didn't
look pretty.

Hoping to have something later in the cycle.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: How to make partitioning scale better for larger numbers of partitions

2018-07-13 Thread Justin Pryzby
On Fri, Jul 13, 2018 at 05:49:20AM +, Tsunakawa, Takayuki wrote:
> David has submitted multiple patches for PG 12, one of which speeds up 
> pruning of UPDATE/DELETE (I couldn't find it in the current CF, though.)  
> What challenges are there for future versions, and which of them are being 
> addressed by patches in progress for PG 12, and which issues are untouched?

Here's an known issue which I'm not sure is on anybody's list:
https://www.postgresql.org/message-id/flat/4F989CD8.8020804%40strategicdata.com.au#4f989cd8.8020...@strategicdata.com.au
=> planning of UPDATE/DELETE (but not SELECT) takes huge amount of RAM when run
on parent with many childs.

We don't typically have UPDATEs, and those we do are against the child tables;
but ran into this last month with a manual query on parent causing OOM.  I
worked around it, but keep meaning to revisit to see if this changed at all in
v11 (very brief testing suggests not changed).

Justin



RE: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Tsunakawa, Takayuki
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
> is pretty significant and gets worse as the number of partitions grows.
> I
> had intended to fix that in PG 11, but we could only manage to get part
> of
> that work into PG 11, with significant help from David and others.  So,
> while PG 11's overhead of partitioning during planning is less than PG
> 10's due to improved pruning algorithm, it's still pretty far from ideal,
> because it isn't just the pruning algorithm that had overheads.  In fact,
> PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
> carry the overhead that was in PG 10.

David has submitted multiple patches for PG 12, one of which speeds up pruning 
of UPDATE/DELETE (I couldn't find it in the current CF, though.)  What 
challenges are there for future versions, and which of them are being addressed 
by patches in progress for PG 12, and which issues are untouched?

> The overheads I mention stem from
> the fact that for partitioning we still rely on the old planner code
> that's used to perform inheritance planning, which requires to lock and
> open *all* partitions.  We have so far been able to refactor just enough
> to use the new code for partition pruning, but there is much refactoring
> work left to avoid needlessly locking and opening all partitions.

I remember the issue of opening and locking all partitions was discussed 
before.  Does this relate only to the planning phase?

Kato-san, could you try pgbench -M prepared?


Regards
Takayuki Tsunakawa






Re: How to make partitioning scale better for larger numbers of partitions

2018-07-12 Thread Amit Langote
Kato-san,

On 2018/07/13 11:58, Kato, Sho wrote:
> Hi,
> 
> I benchmarked on a RANGE partitioned table with 1.1k leaf partitions and no 
> sub-partitioned tables.

Thanks for sharing the results.

> But, statement latencies on a partitioned table is much slower than on a 
> non-partitioned table.
> 
> UPDATE latency is 210 times slower than a non-partitioned table.
> SELECT latency is 36 times slower than a non-partitioned table.
> Surprisingly INSERT latency is almost same.

Yes, INSERT comes out ahead because there is no overhead of partitioning
in the planning phase.  As David Rowley reported on the nearby thread
("Speeding up INSERTs and UPDATEs to partitioned tables"), there is still
significant overhead during its execution, so we're still a bit a fair bit
away from the best possible performance.

For SELECT/UPDATE/DELETE, overhead of partitioning in the planning phase
is pretty significant and gets worse as the number of partitions grows.  I
had intended to fix that in PG 11, but we could only manage to get part of
that work into PG 11, with significant help from David and others.  So,
while PG 11's overhead of partitioning during planning is less than PG
10's due to improved pruning algorithm, it's still pretty far from ideal,
because it isn't just the pruning algorithm that had overheads.  In fact,
PG 11 only removes the pruning overhead for SELECT, so UPDATE/DELETE still
carry the overhead that was in PG 10.  The overheads I mention stem from
the fact that for partitioning we still rely on the old planner code
that's used to perform inheritance planning, which requires to lock and
open *all* partitions.  We have so far been able to refactor just enough
to use the new code for partition pruning, but there is much refactoring
work left to avoid needlessly locking and opening all partitions.

Thanks,
Amit