Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-24 Thread Amit Langote
On Fri, Mar 24, 2017 at 11:06 PM, Simon Riggs  wrote:
> On 1 March 2017 at 01:36, Amit Langote  wrote:
>
>> I don't know which way you're thinking of fixing this, but a planner patch
>> to implement faster partition-pruning will have taken care of this, I
>> think.  As you may know, even declarative partitioned tables currently
>> depend on constraint exclusion for partition-pruning and planner's current
>> approach of handling inheritance requires to open all the child tables
>> (partitions), whereas the new approach hopefully shouldn't need to do
>> that.  I am not sure if looking for a more localized fix for this would be
>> worthwhile, although I may be wrong.
>
> What "new approach" are we discussing?
>
> Is there a patch or design discussion?

Neither at the moment.  As Aleksander said in his reply I was
referring to Dmitry Ivanov's plan of porting pg_pathman's planner
functionality to core that he mentioned on the declarative
partitioning thread back in December [1].

Thanks,
Amit

[1] 
https://www.postgresql.org/message-id/426b2b01-61e0-43aa-bd84-c6fcf516f1c3%40postgrespro.ru


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-24 Thread Aleksander Alekseev
Hi Simon,

> > I don't know which way you're thinking of fixing this, but a planner patch
> > to implement faster partition-pruning will have taken care of this, I
> > think.  As you may know, even declarative partitioned tables currently
> > depend on constraint exclusion for partition-pruning and planner's current
> > approach of handling inheritance requires to open all the child tables
> > (partitions), whereas the new approach hopefully shouldn't need to do
> > that.  I am not sure if looking for a more localized fix for this would be
> > worthwhile, although I may be wrong.
> 
> What "new approach" are we discussing?
> Is there a patch or design discussion?

I think what was meant was plans of my colleague Dmitry Ivanov to
implement partition-pruning. I've just spoke with Dmitry about this
matter. Unless there is anyone else who is already working on this
optimization we would like to work on it together. Unfortunately there
is no patch or design discussion of partition-pruning on this
commitfest.

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-24 Thread Simon Riggs
On 1 March 2017 at 01:36, Amit Langote  wrote:

> I don't know which way you're thinking of fixing this, but a planner patch
> to implement faster partition-pruning will have taken care of this, I
> think.  As you may know, even declarative partitioned tables currently
> depend on constraint exclusion for partition-pruning and planner's current
> approach of handling inheritance requires to open all the child tables
> (partitions), whereas the new approach hopefully shouldn't need to do
> that.  I am not sure if looking for a more localized fix for this would be
> worthwhile, although I may be wrong.

What "new approach" are we discussing?

Is there a patch or design discussion?

-- 
Simon Riggshttp://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-10 Thread Aleksander Alekseev
Hi Tels,

Thanks a lot for the review!

> "corresponding"

Fixed.

> Also a question: Some one-line comments are
> 
>  /* Comment. */
> 
> while others are
> 
>  /*
>   * Comment.
>   */
> 
> Why the difference?

I'm trying to follow a code stile used in a code I'm modifying. In this
case I got an impression that second style of one-line comments is used
more often an I tried to follow it but apparently I didn't succeed :)
This is fixed as well.

On Thu, Mar 09, 2017 at 06:40:39PM -0500, Tels wrote:
> Hi Aleksander,
> 
> noticed this in your patch:
> 
> > * Add a corespinding entry to pgStatTabHash.
> 
> "corresponding"
> 
> Also a question: Some one-line comments are
> 
>  /* Comment. */
> 
> while others are
> 
>  /*
>   * Comment.
>   */
> 
> Why the difference?
> 
> Hope this helps,
> 
> Tels

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 7cacb1e9b2..a22a5a37c8 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -161,6 +161,20 @@ typedef struct TabStatusArray
 static TabStatusArray *pgStatTabList = NULL;
 
 /*
+ * pgStatTabHash entry
+ */
+typedef struct TabStatHashEntry
+{
+	Oid t_id;
+	PgStat_TableStatus* tsa_entry;
+} TabStatHashEntry;
+
+/*
+ * Hash table for O(1) t_id -> tsa_entry lookup
+ */
+static HTAB *pgStatTabHash = NULL;
+
+/*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
  */
@@ -815,7 +829,13 @@ pgstat_report_stat(bool force)
 pgstat_send_tabstat(this_msg);
 this_msg->m_nentries = 0;
 			}
+
+			/*
+			 * Entry will be zeroed soon so we need to remove it from the lookup table.
+			 */
+			hash_search(pgStatTabHash, >t_id, HASH_REMOVE, NULL);
 		}
+
 		/* zero out TableStatus structs after use */
 		MemSet(tsa->tsa_entries, 0,
 			   tsa->tsa_used * sizeof(PgStat_TableStatus));
@@ -1667,59 +1687,77 @@ pgstat_initstats(Relation rel)
 }
 
 /*
+ * Make sure pgStatTabList and pgStatTabHash are initialized.
+ */
+static void
+make_sure_stat_tab_initialized()
+{
+	HASHCTL ctl;
+
+	if (pgStatTabList)
+		return;
+
+	Assert(!pgStatTabHash);
+
+	memset(, 0, sizeof(ctl));
+	ctl.keysize = sizeof(Oid);
+	ctl.entrysize = sizeof(TabStatHashEntry);
+	ctl.hcxt = TopMemoryContext;
+
+	pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup table",
+		TABSTAT_QUANTUM, , HASH_ELEM | HASH_BLOBS | HASH_CONTEXT);
+
+	pgStatTabList = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext,
+sizeof(TabStatusArray));
+}
+
+/*
  * get_tabstat_entry - find or create a PgStat_TableStatus entry for rel
  */
 static PgStat_TableStatus *
 get_tabstat_entry(Oid rel_id, bool isshared)
 {
+	TabStatHashEntry* hash_entry;
 	PgStat_TableStatus *entry;
 	TabStatusArray *tsa;
-	TabStatusArray *prev_tsa;
-	int			i;
+	bool found;
+
+	make_sure_stat_tab_initialized();
+
+	/*
+	 * Find an entry or create a new one.
+	 */
+	hash_entry = hash_search(pgStatTabHash, _id, HASH_ENTER, );
+	if(found)
+		return hash_entry->tsa_entry;
 
 	/*
-	 * Search the already-used tabstat slots for this relation.
+	 * `hash_entry` was just created and now we have to fill it.
+	 * First make sure there is a free space in a last element of pgStatTabList.
 	 */
-	prev_tsa = NULL;
-	for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = tsa->tsa_next)
+	tsa = pgStatTabList;
+	while(tsa->tsa_used == TABSTAT_QUANTUM)
 	{
-		for (i = 0; i < tsa->tsa_used; i++)
+		if(tsa->tsa_next == NULL)
 		{
-			entry = >tsa_entries[i];
-			if (entry->t_id == rel_id)
-return entry;
+			tsa->tsa_next = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext,
+		sizeof(TabStatusArray));
 		}
 
-		if (tsa->tsa_used < TABSTAT_QUANTUM)
-		{
-			/*
-			 * It must not be present, but we found a free slot instead. Fine,
-			 * let's use this one.  We assume the entry was already zeroed,
-			 * either at creation or after last use.
-			 */
-			entry = >tsa_entries[tsa->tsa_used++];
-			entry->t_id = rel_id;
-			entry->t_shared = isshared;
-			return entry;
-		}
+		tsa = tsa->tsa_next;
 	}
 
 	/*
-	 * We ran out of tabstat slots, so allocate more.  Be sure they're zeroed.
-	 */
-	tsa = (TabStatusArray *) MemoryContextAllocZero(TopMemoryContext,
-	sizeof(TabStatusArray));
-	if (prev_tsa)
-		prev_tsa->tsa_next = tsa;
-	else
-		pgStatTabList = tsa;
-
-	/*
-	 * Use the first entry of the new TabStatusArray.
+	 * Add an entry.
 	 */
 	entry = >tsa_entries[tsa->tsa_used++];
 	entry->t_id = rel_id;
 	entry->t_shared = isshared;
+
+	/*
+	 * Add a corresponding entry to pgStatTabHash.
+	 */
+	hash_entry->tsa_entry = entry;
 	return entry;
 }
 
@@ -1731,22 +1769,19 @@ get_tabstat_entry(Oid rel_id, bool isshared)
 PgStat_TableStatus *
 find_tabstat_entry(Oid rel_id)
 {
-	PgStat_TableStatus *entry;
-	TabStatusArray *tsa;
-	int			i;
+	TabStatHashEntry* hash_entry;
 
-	for (tsa = pgStatTabList; tsa != NULL; tsa = tsa->tsa_next)
-	{
-		

Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-09 Thread Tels
Hi Aleksander,

noticed this in your patch:

> * Add a corespinding entry to pgStatTabHash.

"corresponding"

Also a question: Some one-line comments are

 /* Comment. */

while others are

 /*
  * Comment.
  */

Why the difference?

Hope this helps,

Tels

PS: Sorry if this appears twice, I fumbled the From: ...



-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-09 Thread Tels
Hi Aleksander,

noticed this in your patch:

> * Add a corespinding entry to pgStatTabHash.

"corresponding"

Also a question: Some one-line comments are

 /* Comment. */

while others are

 /*
  * Comment.
  */

Why the difference?

Hope this helps,

Tels


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-09 Thread Aleksander Alekseev
Hi Amit,

> Sorry, I didn't mean to dissuade you from trying those
> micro-optimizations.  If general inheritance cases can benefit from it
> (which, until we have a different method, will be used by partitioned
> tables as well), I think we should try it.

OK, I'll see what could be done here as well then.

On Tue, Mar 07, 2017 at 10:55:12AM +0900, Amit Langote wrote:
> Hi Aleksander,
> 
> On 2017/03/07 0:22, Aleksander Alekseev wrote:
> > Hello.
> > 
> > OK, here is a patch.
> > 
> > Benchmark, before:
> > 
> > ```
> > number of transactions actually processed: 1823
> > latency average = 1153.495 ms
> > latency stddev = 154.366 ms
> > tps = 6.061104 (including connections establishing)
> > tps = 6.061211 (excluding connections establishing)
> > ```
> > 
> > Benchmark, after:
> > 
> > ```
> > number of transactions actually processed: 2462
> > latency average = 853.862 ms
> > latency stddev = 112.270 ms
> > tps = 8.191861 (including connections establishing)
> > tps = 8.192028 (excluding connections establishing)
> > ```
> > 
> > +35% TPS, just as expected. Feel free to run your own benchmarks on
> > different datasets and workloads. `perf top` shows that first bottleneck
> > is completely eliminated.
> 
> That seems like a good gain.
> 
> > I did nothing about the second bottleneck
> > since as Amit mentioned partition-pruning should solve this anyway and
> > apparently any micro-optimizations don't worth an effort.
> 
> Sorry, I didn't mean to dissuade you from trying those
> micro-optimizations.  If general inheritance cases can benefit from it
> (which, until we have a different method, will be used by partitioned
> tables as well), I think we should try it.
> 
> Thanks,
> Amit
> 
> 

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature


Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-09 Thread Aleksander Alekseev
Hi, Andres

Thanks a lot for the review!

> Why are we keeping that list / the "batch" allocation pattern? It
> doesn't actually seem useful to me after these changes.  Given that we
> don't shrink hash-tables, we could just use the hash-table memory for
> this, no?

I don't think we can do that. According to comments:

```
 * NOTE: once allocated, TabStatusArray structures are never moved or deleted
 [...]
 * This allows relcache pgstat_info pointers to be treated as long-lived data,
 * avoiding repeated searches in pgstat_initstats() when a relation is
 * repeatedly opened during a transaction.
```

It is my understanding that dynahash can't give same guarantees.

> 'faster ... lookup' doesn't strike me as a useful comment, because it's
> only useful in relation of the current code to the new code.

Agree, fixed to 'O(1) ... lookup'.

> It's not really freed...

Right, it's actually zeroed. Fixed.

> Arguably it'd be better to HASH_ENTER directly here, instead of doing
> two lookups.

Good point. Fixed.

> Think it'd be better to move the hash creation into its own function.

Agree, fixed.

On Mon, Mar 06, 2017 at 12:01:50PM -0800, Andres Freund wrote:
> Hi,
> 
> This issue has bothered me in non-partitioned use-cases recently, so
> thanks for taking it up.
> 
> 
> On 2017-03-06 18:22:17 +0300, Aleksander Alekseev wrote:
> > diff --git a/src/backend/postmaster/pgstat.c 
> > b/src/backend/postmaster/pgstat.c
> > index 2fb9a8bf58..fa906e7930 100644
> > --- a/src/backend/postmaster/pgstat.c
> > +++ b/src/backend/postmaster/pgstat.c
> > @@ -160,6 +160,16 @@ typedef struct TabStatusArray
> >  
> >  static TabStatusArray *pgStatTabList = NULL;
> 
> Why are we keeping that list / the "batch" allocation pattern? It
> doesn't actually seem useful to me after these changes.  Given that we
> don't shrink hash-tables, we could just use the hash-table memory for
> this, no?
> 
> I think a separate list that only contains things that are "active" in
> the current transaction makes sense, because the current logic requires
> iterating over a potentially very large array at transaction commit.
> 
> 
> > +/* pgStatTabHash entry */
> > +typedef struct TabStatHashEntry
> > +{
> > +   Oid t_id;
> > +   PgStat_TableStatus* tsa_entry;
> > +} TabStatHashEntry;
> > +
> > +/* Hash table for faster t_id -> tsa_entry lookup */
> > +static HTAB *pgStatTabHash = NULL;
> > +
> 
> 'faster ... lookup' doesn't strike me as a useful comment, because it's
> only useful in relation of the current code to the new code.
> 
> 
> 
> >  /*
> >   * Backends store per-function info that's waiting to be sent to the 
> > collector
> >   * in this hash table (indexed by function OID).
> > @@ -815,7 +825,13 @@ pgstat_report_stat(bool force)
> > pgstat_send_tabstat(this_msg);
> > this_msg->m_nentries = 0;
> > }
> > +
> > +   /*
> > +* Entry will be freed soon so we need to remove it 
> > from the lookup table.
> > +*/
> > +   hash_search(pgStatTabHash, >t_id, HASH_REMOVE, 
> > NULL);
> > }
> 
> It's not really freed...
> 
> 
> >  static PgStat_TableStatus *
> >  get_tabstat_entry(Oid rel_id, bool isshared)
> >  {
> > +   TabStatHashEntry* hash_entry;
> > PgStat_TableStatus *entry;
> > TabStatusArray *tsa;
> > -   TabStatusArray *prev_tsa;
> > -   int i;
> > +
> > +   /* Try to find an entry */
> > +   entry = find_tabstat_entry(rel_id);
> > +   if(entry != NULL)
> > +   return entry;
> 
> Arguably it'd be better to HASH_ENTER directly here, instead of doing
> two lookups.
> 
> 
> > /*
> > -* Search the already-used tabstat slots for this relation.
> > +* Entry doesn't exist - creating one.
> > +* First make sure there is a free space in a last element of 
> > pgStatTabList.
> >  */
> > -   prev_tsa = NULL;
> > -   for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = 
> > tsa->tsa_next)
> > +   if (!pgStatTabList)
> > {
> > -   for (i = 0; i < tsa->tsa_used; i++)
> > -   {
> > -   entry = >tsa_entries[i];
> > -   if (entry->t_id == rel_id)
> > -   return entry;
> > -   }
> > +   HASHCTL ctl;
> >  
> > -   if (tsa->tsa_used < TABSTAT_QUANTUM)
> > +   Assert(!pgStatTabHash);
> > +
> > +   memset(, 0, sizeof(ctl));
> > +   ctl.keysize = sizeof(Oid);
> > +   ctl.entrysize = sizeof(TabStatHashEntry);
> > +   ctl.hcxt = TopMemoryContext;
> > +
> > +   pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup 
> > table",
> > +   TABSTAT_QUANTUM, , HASH_ELEM | HASH_BLOBS | 
> > HASH_CONTEXT);
> 
> Think it'd be better to move the hash creation into its own function.
> 
> 
> - Andres

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/backend/postmaster/pgstat.c 

Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-06 Thread Amit Langote
Hi Aleksander,

On 2017/03/07 0:22, Aleksander Alekseev wrote:
> Hello.
> 
> OK, here is a patch.
> 
> Benchmark, before:
> 
> ```
> number of transactions actually processed: 1823
> latency average = 1153.495 ms
> latency stddev = 154.366 ms
> tps = 6.061104 (including connections establishing)
> tps = 6.061211 (excluding connections establishing)
> ```
> 
> Benchmark, after:
> 
> ```
> number of transactions actually processed: 2462
> latency average = 853.862 ms
> latency stddev = 112.270 ms
> tps = 8.191861 (including connections establishing)
> tps = 8.192028 (excluding connections establishing)
> ```
> 
> +35% TPS, just as expected. Feel free to run your own benchmarks on
> different datasets and workloads. `perf top` shows that first bottleneck
> is completely eliminated.

That seems like a good gain.

> I did nothing about the second bottleneck
> since as Amit mentioned partition-pruning should solve this anyway and
> apparently any micro-optimizations don't worth an effort.

Sorry, I didn't mean to dissuade you from trying those
micro-optimizations.  If general inheritance cases can benefit from it
(which, until we have a different method, will be used by partitioned
tables as well), I think we should try it.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-06 Thread Andres Freund
Hi,

This issue has bothered me in non-partitioned use-cases recently, so
thanks for taking it up.


On 2017-03-06 18:22:17 +0300, Aleksander Alekseev wrote:
> diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
> index 2fb9a8bf58..fa906e7930 100644
> --- a/src/backend/postmaster/pgstat.c
> +++ b/src/backend/postmaster/pgstat.c
> @@ -160,6 +160,16 @@ typedef struct TabStatusArray
>  
>  static TabStatusArray *pgStatTabList = NULL;

Why are we keeping that list / the "batch" allocation pattern? It
doesn't actually seem useful to me after these changes.  Given that we
don't shrink hash-tables, we could just use the hash-table memory for
this, no?

I think a separate list that only contains things that are "active" in
the current transaction makes sense, because the current logic requires
iterating over a potentially very large array at transaction commit.


> +/* pgStatTabHash entry */
> +typedef struct TabStatHashEntry
> +{
> + Oid t_id;
> + PgStat_TableStatus* tsa_entry;
> +} TabStatHashEntry;
> +
> +/* Hash table for faster t_id -> tsa_entry lookup */
> +static HTAB *pgStatTabHash = NULL;
> +

'faster ... lookup' doesn't strike me as a useful comment, because it's
only useful in relation of the current code to the new code.



>  /*
>   * Backends store per-function info that's waiting to be sent to the 
> collector
>   * in this hash table (indexed by function OID).
> @@ -815,7 +825,13 @@ pgstat_report_stat(bool force)
>   pgstat_send_tabstat(this_msg);
>   this_msg->m_nentries = 0;
>   }
> +
> + /*
> +  * Entry will be freed soon so we need to remove it 
> from the lookup table.
> +  */
> + hash_search(pgStatTabHash, >t_id, HASH_REMOVE, 
> NULL);
>   }

It's not really freed...


>  static PgStat_TableStatus *
>  get_tabstat_entry(Oid rel_id, bool isshared)
>  {
> + TabStatHashEntry* hash_entry;
>   PgStat_TableStatus *entry;
>   TabStatusArray *tsa;
> - TabStatusArray *prev_tsa;
> - int i;
> +
> + /* Try to find an entry */
> + entry = find_tabstat_entry(rel_id);
> + if(entry != NULL)
> + return entry;

Arguably it'd be better to HASH_ENTER directly here, instead of doing
two lookups.


>   /*
> -  * Search the already-used tabstat slots for this relation.
> +  * Entry doesn't exist - creating one.
> +  * First make sure there is a free space in a last element of 
> pgStatTabList.
>*/
> - prev_tsa = NULL;
> - for (tsa = pgStatTabList; tsa != NULL; prev_tsa = tsa, tsa = 
> tsa->tsa_next)
> + if (!pgStatTabList)
>   {
> - for (i = 0; i < tsa->tsa_used; i++)
> - {
> - entry = >tsa_entries[i];
> - if (entry->t_id == rel_id)
> - return entry;
> - }
> + HASHCTL ctl;
>  
> - if (tsa->tsa_used < TABSTAT_QUANTUM)
> + Assert(!pgStatTabHash);
> +
> + memset(, 0, sizeof(ctl));
> + ctl.keysize = sizeof(Oid);
> + ctl.entrysize = sizeof(TabStatHashEntry);
> + ctl.hcxt = TopMemoryContext;
> +
> + pgStatTabHash = hash_create("pgstat t_id to tsa_entry lookup 
> table",
> + TABSTAT_QUANTUM, , HASH_ELEM | HASH_BLOBS | 
> HASH_CONTEXT);

Think it'd be better to move the hash creation into its own function.


- Andres


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-03-06 Thread Aleksander Alekseev
Hello.

OK, here is a patch.

Benchmark, before:

```
number of transactions actually processed: 1823
latency average = 1153.495 ms
latency stddev = 154.366 ms
tps = 6.061104 (including connections establishing)
tps = 6.061211 (excluding connections establishing)
```

Benchmark, after:

```
number of transactions actually processed: 2462
latency average = 853.862 ms
latency stddev = 112.270 ms
tps = 8.191861 (including connections establishing)
tps = 8.192028 (excluding connections establishing)
```

+35% TPS, just as expected. Feel free to run your own benchmarks on
different datasets and workloads. `perf top` shows that first bottleneck
is completely eliminated. I did nothing about the second bottleneck
since as Amit mentioned partition-pruning should solve this anyway and
apparently any micro-optimizations don't worth an effort.

Also I checked test coverage using lcov to make sure that this code is
test covered. An exact script I'm using could be found here [1].

[1] https://github.com/afiskon/pgscripts/blob/master/code-coverage.sh

On Wed, Mar 01, 2017 at 10:36:24AM +0900, Amit Langote wrote:
> Hi,
> 
> On 2017/02/28 23:25, Aleksander Alekseev wrote:
> > Hello.
> > 
> > I decided to figure out whether current implementation of declarative
> > partitioning has any bottlenecks when there is a lot of partitions. Here
> > is what I did [1].
> 
> Thanks for sharing.
> 
> > Then:
> > 
> > ```
> > # 2580 is some pk that exists
> > echo 'select * from part_test where pk = 2580;' > t.sql
> > pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax
> > ```
> > 
> > `perf top` showed to bottlenecks [2]. A stacktrace for the first one
> > looks like this [3]:
> > 
> > ```
> > 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') 
> > at pgstat.c:1689
> > 1689if (entry->t_id == rel_id)
> > #0  0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 
> > '\000') at pgstat.c:1689
> > #1  0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at 
> > pgstat.c:1666
> > #2  0x004c7090 in relation_open (relationId=25696, lockmode=0) at 
> > heapam.c:1137
> > #3  0x004c72c9 in heap_open (relationId=25696, lockmode=0) at 
> > heapam.c:1291
> > (skipped)
> > ```
> > 
> > And here is a stacktrace for the second bottleneck [4]:
> > 
> > ```
> > 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, 
> > numparents=0x0) at pg_inherits.c:199
> > 199 forboth(lo, rels_list, li, rel_numparents)
> > #0  0x00584fb1 in find_all_inheritors (parentrelId=16393, 
> > lockmode=1, numparents=0x0) at pg_inherits.c:199
> > #1  0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, 
> > rte=0x1b630b8, rti=1) at prepunion.c:1408
> > #2  0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at 
> > prepunion.c:1335
> > #3  0x00767526 in subquery_planner (glob=0x1b63cc0, 
> > parse=0x1b62fa0, parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) 
> > at planner.c:568
> > (skipped)
> > ```
> > 
> > The first one could be easily fixed by introducing a hash table
> > (rel_id -> pgStatList entry). Perhaps hash table should be used only
> > after some threshold. Unless there are any objections I will send a
> > corresponding patch shortly.
> 
> I have never thought about this one really.
> 
> > I didn't explored the second bottleneck closely yet but at first glance
> > it doesn't look much more complicated.
> 
> I don't know which way you're thinking of fixing this, but a planner patch
> to implement faster partition-pruning will have taken care of this, I
> think.  As you may know, even declarative partitioned tables currently
> depend on constraint exclusion for partition-pruning and planner's current
> approach of handling inheritance requires to open all the child tables
> (partitions), whereas the new approach hopefully shouldn't need to do
> that.  I am not sure if looking for a more localized fix for this would be
> worthwhile, although I may be wrong.
> 
> Thanks,
> Amit
> 
> 

-- 
Best regards,
Aleksander Alekseev
diff --git a/src/backend/postmaster/pgstat.c b/src/backend/postmaster/pgstat.c
index 2fb9a8bf58..fa906e7930 100644
--- a/src/backend/postmaster/pgstat.c
+++ b/src/backend/postmaster/pgstat.c
@@ -160,6 +160,16 @@ typedef struct TabStatusArray
 
 static TabStatusArray *pgStatTabList = NULL;
 
+/* pgStatTabHash entry */
+typedef struct TabStatHashEntry
+{
+	Oid t_id;
+	PgStat_TableStatus* tsa_entry;
+} TabStatHashEntry;
+
+/* Hash table for faster t_id -> tsa_entry lookup */
+static HTAB *pgStatTabHash = NULL;
+
 /*
  * Backends store per-function info that's waiting to be sent to the collector
  * in this hash table (indexed by function OID).
@@ -815,7 +825,13 @@ pgstat_report_stat(bool force)
 pgstat_send_tabstat(this_msg);
 this_msg->m_nentries = 0;
 			}
+
+			/*
+			 * Entry will be freed soon so we need to remove it from the lookup table.
+			 */
+			

Re: [HACKERS] Declarative partitioning optimization for large amount of partitions

2017-02-28 Thread Amit Langote
Hi,

On 2017/02/28 23:25, Aleksander Alekseev wrote:
> Hello.
> 
> I decided to figure out whether current implementation of declarative
> partitioning has any bottlenecks when there is a lot of partitions. Here
> is what I did [1].

Thanks for sharing.

> Then:
> 
> ```
> # 2580 is some pk that exists
> echo 'select * from part_test where pk = 2580;' > t.sql
> pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax
> ```
> 
> `perf top` showed to bottlenecks [2]. A stacktrace for the first one
> looks like this [3]:
> 
> ```
> 0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at 
> pgstat.c:1689
> 1689  if (entry->t_id == rel_id)
> #0  0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') 
> at pgstat.c:1689
> #1  0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at 
> pgstat.c:1666
> #2  0x004c7090 in relation_open (relationId=25696, lockmode=0) at 
> heapam.c:1137
> #3  0x004c72c9 in heap_open (relationId=25696, lockmode=0) at 
> heapam.c:1291
> (skipped)
> ```
> 
> And here is a stacktrace for the second bottleneck [4]:
> 
> ```
> 0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, 
> numparents=0x0) at pg_inherits.c:199
> 199   forboth(lo, rels_list, li, rel_numparents)
> #0  0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, 
> numparents=0x0) at pg_inherits.c:199
> #1  0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, 
> rte=0x1b630b8, rti=1) at prepunion.c:1408
> #2  0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at 
> prepunion.c:1335
> #3  0x00767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, 
> parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) at planner.c:568
> (skipped)
> ```
> 
> The first one could be easily fixed by introducing a hash table
> (rel_id -> pgStatList entry). Perhaps hash table should be used only
> after some threshold. Unless there are any objections I will send a
> corresponding patch shortly.

I have never thought about this one really.

> I didn't explored the second bottleneck closely yet but at first glance
> it doesn't look much more complicated.

I don't know which way you're thinking of fixing this, but a planner patch
to implement faster partition-pruning will have taken care of this, I
think.  As you may know, even declarative partitioned tables currently
depend on constraint exclusion for partition-pruning and planner's current
approach of handling inheritance requires to open all the child tables
(partitions), whereas the new approach hopefully shouldn't need to do
that.  I am not sure if looking for a more localized fix for this would be
worthwhile, although I may be wrong.

Thanks,
Amit




-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Declarative partitioning optimization for large amount of partitions

2017-02-28 Thread Aleksander Alekseev
Hello.

I decided to figure out whether current implementation of declarative
partitioning has any bottlenecks when there is a lot of partitions. Here
is what I did [1].

```
-- init schema

\timing on

CREATE TABLE part_test (pk int not null, k int, v varchar(128)) PARTITION BY 
RANGE(pk);

do $$
declare
i integer;
begin
for i in 1 .. 1
loop
raise notice 'i = %', i;
execute ('CREATE TABLE part_test_' || i ||
 ' PARTITION OF part_test FOR VALUES FROM (' ||
 (1 + (i-1)*1000) || ') to (' || ( (i * 1000) + 1) || ');'
);
end loop;
end $$;

-- fill tables with some data

do $$
declare
i integer;
begin
for i in 1 .. 100*1000
loop
raise notice 'i = %', i;
execute ('insert into part_test values ( ceil(random()*(1-1)*1000), 
ceil(random()*1*1000),  || ceil(random()*1*1000) );');
end loop;
end $$;
```

Then:

```
# 2580 is some pk that exists
echo 'select * from part_test where pk = 2580;' > t.sql
pgbench -j 7 -c 7 -f t.sql -P 1 -T 300 eax
```

`perf top` showed to bottlenecks [2]. A stacktrace for the first one
looks like this [3]:

```
0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') at 
pgstat.c:1689
1689if (entry->t_id == rel_id)
#0  0x007a42e2 in get_tabstat_entry (rel_id=25696, isshared=0 '\000') 
at pgstat.c:1689
#1  0x007a4275 in pgstat_initstats (rel=0x7f4af3fd41f8) at pgstat.c:1666
#2  0x004c7090 in relation_open (relationId=25696, lockmode=0) at 
heapam.c:1137
#3  0x004c72c9 in heap_open (relationId=25696, lockmode=0) at 
heapam.c:1291
(skipped)
```

And here is a stacktrace for the second bottleneck [4]:

```
0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, 
numparents=0x0) at pg_inherits.c:199
199 forboth(lo, rels_list, li, rel_numparents)
#0  0x00584fb1 in find_all_inheritors (parentrelId=16393, lockmode=1, 
numparents=0x0) at pg_inherits.c:199
#1  0x0077fc9f in expand_inherited_rtentry (root=0x1badcb8, 
rte=0x1b630b8, rti=1) at prepunion.c:1408
#2  0x0077fb67 in expand_inherited_tables (root=0x1badcb8) at 
prepunion.c:1335
#3  0x00767526 in subquery_planner (glob=0x1b63cc0, parse=0x1b62fa0, 
parent_root=0x0, hasRecursion=0 '\000', tuple_fraction=0) at planner.c:568
(skipped)
```

The first one could be easily fixed by introducing a hash table
(rel_id -> pgStatList entry). Perhaps hash table should be used only
after some threshold. Unless there are any objections I will send a
corresponding patch shortly.

I didn't explored the second bottleneck closely yet but at first glance
it doesn't look much more complicated.

Please don't hesitate to share your thoughts regarding this matter.

[1] http://afiskon.ru/s/e3/5f47af9102_benchmark.txt
[2] http://afiskon.ru/s/00/2008c4ae66_temp.png
[3] http://afiskon.ru/s/23/650f0afc89_stack.txt
[4] http://afiskon.ru/s/03/a7e685a4db_stack2.txt

-- 
Best regards,
Aleksander Alekseev


signature.asc
Description: PGP signature