Re: Thousands of partitions performance questions

2019-04-29 Thread David Rowley
On Mon, 29 Apr 2019 at 19:20, Shai Cantor  wrote:
> Some notes I haven't shared or were not clear enough on the previous post.
>
> Data is inserted using the copy command only and in an offline manner. 
> Meaning, no user action creates or updates the data. An offline job runs is. 
> Number of inserts can reach up to 1500 a day.
> Queries are only on a single partition
>
> Can that ease the performance?

Yes, COPY will be more efficient. In PG11 it still means locking all
partitions but that becomes more worthwhile the more tuples that are
inserted at once.  The same goes for INSERT with multiple rows in the
VALUES clause.

> Will querying directly the partition tables help?

Yes. If you're able to determine which partition to query from within
the application and write that in the query instead, then this will be
much less planner overhead for PG11. It'll be pretty much the same as
if you were querying a normal table.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Thousands of partitions performance questions

2019-04-29 Thread Shai Cantor
Thanks a lot for your feedback.
We're gonna user PG 11 on AWS RDS.
I will do some workload simulations.
Some notes I haven't shared or were not clear enough on the previous post.

   1. *Data is inserted using the copy command only* and in an offline
   manner. Meaning, no user action creates or updates the data. An offline job
   runs is. Number of inserts can reach up to 1500 a day.
   2. *Queries are only on a single partition*



   - Can that ease the performance?
   - Will querying directly the partition tables help?


On Mon, Apr 29, 2019 at 8:46 AM David Rowley 
wrote:

> On Mon, 29 Apr 2019 at 17:13, Shai Cantor  wrote:
> > Will the db hold 135000 (45000 * 3 months) partitions under the
> assumption that I query only 1 partition?
> > Should I model it differently in terms of schema, partitions etc.?
>
> Which PG version?
>
> Before PG11 the query planner pruned unneeded partition by looking at
> each partition and determining if the partition constraint could not
> match the base quals on the query.  PG11 improved this by adding a
> smarter and faster algorithm to get rid of non-matching partitions,
> however, this really only speeds things up a little as it only really
> allows the planner to skip generating paths for these partitions,
> certain meta-data is still loaded, which is not really slow per
> partition, but it is slow if you have thousands of partitions.
>
> PG11 is also pretty bad at planning UPDATE/DELETEs to partitioned
> tables with a large number of partitions. You'll at best suffer from
> high planning times to plan these queries and at worst suffer out of
> memory errors with that many partitions.
>
> The yet to be released PG12 improves both of these deficiencies
> providing you can eliminate the majority of partitions during query
> planning. PG12 also improves the performance of INSERT into a
> partitioned table, wherein PG11 a lock was obtained on every
> partition, with PG12 we only grab a lock on a partition the first time
> the INSERT command inserts a row into it. If you're just INSERTing 1
> row per command into a partitioned table with many partitions then
> this makes a pretty big difference.
>
> Depending on the types of query you're running it's likely not a good
> idea to go above 100 or so partitions with PG11. You might get away
> with more if you're running a more data-warehouse type load, i.e fewer
> but longer running queries, but for a more OLTP type workload, with
> more queries and lower latencies, then you may struggle to cope with a
> dozen.
>
> I'd recommend you do workload simulations with whatever number you
> choose and ensure performance is to the level you require before
> getting in too deep with your design.
>
> If your go-live date is near the end of the year or beyond, then it
> might be a good idea to start testing with PG12 right away. The
> release date for that will likely be around the middle of October.
>
> --
>  David Rowley   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services
>


Re: Thousands of partitions performance questions

2019-04-28 Thread David Rowley
On Mon, 29 Apr 2019 at 17:13, Shai Cantor  wrote:
> Will the db hold 135000 (45000 * 3 months) partitions under the assumption 
> that I query only 1 partition?
> Should I model it differently in terms of schema, partitions etc.?

Which PG version?

Before PG11 the query planner pruned unneeded partition by looking at
each partition and determining if the partition constraint could not
match the base quals on the query.  PG11 improved this by adding a
smarter and faster algorithm to get rid of non-matching partitions,
however, this really only speeds things up a little as it only really
allows the planner to skip generating paths for these partitions,
certain meta-data is still loaded, which is not really slow per
partition, but it is slow if you have thousands of partitions.

PG11 is also pretty bad at planning UPDATE/DELETEs to partitioned
tables with a large number of partitions. You'll at best suffer from
high planning times to plan these queries and at worst suffer out of
memory errors with that many partitions.

The yet to be released PG12 improves both of these deficiencies
providing you can eliminate the majority of partitions during query
planning. PG12 also improves the performance of INSERT into a
partitioned table, wherein PG11 a lock was obtained on every
partition, with PG12 we only grab a lock on a partition the first time
the INSERT command inserts a row into it. If you're just INSERTing 1
row per command into a partitioned table with many partitions then
this makes a pretty big difference.

Depending on the types of query you're running it's likely not a good
idea to go above 100 or so partitions with PG11. You might get away
with more if you're running a more data-warehouse type load, i.e fewer
but longer running queries, but for a more OLTP type workload, with
more queries and lower latencies, then you may struggle to cope with a
dozen.

I'd recommend you do workload simulations with whatever number you
choose and ensure performance is to the level you require before
getting in too deep with your design.

If your go-live date is near the end of the year or beyond, then it
might be a good idea to start testing with PG12 right away. The
release date for that will likely be around the middle of October.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Thousands of partitions performance questions

2019-04-28 Thread Shai Cantor
   - *General*
  - Our system gathers data from CI builds regarding a customer's code
  base
   - *Data includes*
  - type
 - methods/lines/branches A.K.A - code elements
 - files
  - *Queries*
  - The data is queried for a specific build only
  - Queries are aggregations on the code elements and files with some
  filtering and sorting
   - *Volume*
  - I expect to have about 1500 builds a day, 45000 builds a month
  - A build can have around 30 code elements and around 3 files
   - *Retention*
  - Thought about keeping 90 days of builds as retention
   - *Plan*
  - 2 tables
 - code elements
 - files
  - create a partition for each build
  - each day delete partitions older than 90 days
  - create 2 schemas for each client
 -  schema which holds the parent tables
 - _partitions schema that will hold the partitions
  - upon a new build
 - create a partition for the code elements table and for the files
 table in the "_partitions" schema
  - *Questions*
  - Will the db hold 135000 (45000 * 3 months) partitions under the
  assumption that *I query only 1 partition?*
  - Should I model it differently in terms of schema, partitions etc.?