subject:"RE\: Speed up transaction completion faster after many relations are accessed in a transaction"

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2024-01-08 Thread vignesh C

On Thu, 9 Nov 2023 at 21:48, Heikki Linnakangas  wrote:
>
> On 18/09/2023 07:08, David Rowley wrote:
> > On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas  wrote:
> >>> I've added a call to LockAssertNoneHeld(false) in there.
> >>
> >> I don't see it in the patch?
> >
> > hmm. I must've git format-patch before committing that part.
> >
> > I'll try that again... see attached.
>
> This needed a rebase after my ResourceOwner refactoring. Attached.
>
> A few quick comments:
>
> - It would be nice to add a test for the issue that you fixed in patch
> v7, i.e. if you prepare a transaction while holding session-level locks.
>
> - GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you
> are out of memory. Is that handled gracefully or is the lock leaked?

CFBot shows one of the test has aborted at [1] with:
[20:54:28.535] Core was generated by `postgres: subscriber: logical
replication apply worker for subscription 16397 '.
[20:54:28.535] Program terminated with signal SIGABRT, Aborted.
[20:54:28.535] #0  __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
[20:54:28.535] Download failed: Invalid argument.  Continuing without
source file ./signal/../sysdeps/unix/sysv/linux/raise.c.
[20:54:28.627]
[20:54:28.627] Thread 1 (Thread 0x7f0ea02d1a40 (LWP 50984)):
[20:54:28.627] #0  __GI_raise (sig=sig@entry=6) at
../sysdeps/unix/sysv/linux/raise.c:50
...
...
[20:54:28.627] #2  0x5618e989d62f in ExceptionalCondition
(conditionName=conditionName@entry=0x5618e9b40f70
"dlist_is_empty(&(MyProc->myProcLocks[i]))",
fileName=fileName@entry=0x5618e9b40ec0
"../src/backend/storage/lmgr/proc.c", lineNumber=lineNumber@entry=856)
at ../src/backend/utils/error/assert.c:66
[20:54:28.627] No locals.
[20:54:28.627] #3  0x5618e95e6847 in ProcKill (code=, arg=) at ../src/backend/storage/lmgr/proc.c:856
[20:54:28.627] i = 
[20:54:28.627] proc = 
[20:54:28.627] procgloballist = 
[20:54:28.627] __func__ = "ProcKill"
[20:54:28.627] #4  0x5618e959ebcc in shmem_exit
(code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:276
[20:54:28.627] __func__ = "shmem_exit"
[20:54:28.627] #5  0x5618e959ecd0 in proc_exit_prepare
(code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:198
[20:54:28.627] __func__ = "proc_exit_prepare"
[20:54:28.627] #6  0x5618e959ee8e in proc_exit (code=code@entry=1)
at ../src/backend/storage/ipc/ipc.c:111
[20:54:28.627] __func__ = "proc_exit"
[20:54:28.627] #7  0x5618e94aa54d in BackgroundWorkerMain () at
../src/backend/postmaster/bgworker.c:805
[20:54:28.627] local_sigjmp_buf = {{__jmpbuf =
{94665009627112, -3865857745677845768, 0, 0, 140732736634980, 1,
3865354362587970296, 7379258256398875384}, __mask_was_saved = 1,
__saved_mask = {__val = {18446744066192964099, 94665025527920,
94665025527920, 94665025527920, 0, 94665025528120, 8192, 1,
94664997686410, 94665009627040, 94664997622076, 94665025527920, 1, 0,
0, 140732736634980
[20:54:28.627] worker = 0x5618eb37c570
[20:54:28.627] entrypt = 
[20:54:28.627] __func__ = "BackgroundWorkerMain"
[20:54:28.627] #8  0x5618e94b495c in do_start_bgworker
(rw=rw@entry=0x5618eb3b73c8) at
../src/backend/postmaster/postmaster.c:5697
[20:54:28.627] worker_pid = 
[20:54:28.627] __func__ = "do_start_bgworker"
[20:54:28.627] #9  0x5618e94b4c32 in maybe_start_bgworkers () at
../src/backend/postmaster/postmaster.c:5921
[20:54:28.627] rw = 0x5618eb3b73c8
[20:54:28.627] num_launched = 0
[20:54:28.627] now = 0
[20:54:28.627] iter = {cur = 0x5618eb3b79a8, next =
0x5618eb382a20, prev = 0x5618ea44a980 }
[20:54:28.627] #10 0x5618e94b574a in process_pm_pmsignal () at
../src/backend/postmaster/postmaster.c:5073
[20:54:28.627] __func__ = "process_pm_pmsignal"
[20:54:28.627] #11 0x5618e94b5f4a in ServerLoop () at
../src/backend/postmaster/postmaster.c:1760

[1] - https://cirrus-ci.com/task/5118173163290624?logs=cores#L51

Regards,
Vignesh

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-11-09 Thread Heikki Linnakangas


On 18/09/2023 07:08, David Rowley wrote:

On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas  wrote:

I've added a call to LockAssertNoneHeld(false) in there.


I don't see it in the patch?


hmm. I must've git format-patch before committing that part.

I'll try that again... see attached.


This needed a rebase after my ResourceOwner refactoring. Attached.

A few quick comments:

- It would be nice to add a test for the issue that you fixed in patch 
v7, i.e. if you prepare a transaction while holding session-level locks.


- GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you 
are out of memory. Is that handled gracefully or is the lock leaked?


--
Heikki Linnakangas
Neon (https://neon.tech)
diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c
index 74ce5f9491c..8c8585be7ab 100644
--- a/src/backend/access/transam/xact.c
+++ b/src/backend/access/transam/xact.c
@@ -2415,6 +2415,7 @@ PrepareTransaction(void)
 	TransactionId xid = GetCurrentTransactionId();
 	GlobalTransaction gxact;
 	TimestampTz prepared_at;
+	HTAB		   *sessionandxactlocks;
 
 	Assert(!IsInParallelMode());
 
@@ -2560,7 +2561,17 @@ PrepareTransaction(void)
 	StartPrepare(gxact);
 
 	AtPrepare_Notify();
-	AtPrepare_Locks();
+
+	/*
+	 * Prepare the locks and save the returned hash table that describes if
+	 * the lock is held at the session and/or transaction level.  We need to
+	 * know if we're dealing with session locks inside PostPrepare_Locks(),
+	 * but we're unable to build the hash table there due to that function
+	 * only discovering if we're dealing with a session lock while we're in a
+	 * critical section, in which case we can't allocate memory for the hash
+	 * table.
+	 */
+	sessionandxactlocks = AtPrepare_Locks();
 	AtPrepare_PredicateLocks();
 	AtPrepare_PgStat();
 	AtPrepare_MultiXact();
@@ -2587,7 +2598,10 @@ PrepareTransaction(void)
 	 * ProcArrayClearTransaction().  Otherwise, a GetLockConflicts() would
 	 * conclude "xact already committed or aborted" for our locks.
 	 */
-	PostPrepare_Locks(xid);
+	PostPrepare_Locks(xid, sessionandxactlocks);
+
+	/* We no longer need this hash table */
+	hash_destroy(sessionandxactlocks);
 
 	/*
 	 * Let others know about no transaction in progress by me.  This has to be
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 296dc82d2ee..5baf83ac6ce 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,12 @@ DiscardAll(bool isTopLevel)
 	ResetAllOptions();
 	DropAllPreparedStatements();
 	Async_UnlistenAll();
-	LockReleaseAll(USER_LOCKMETHOD, true);
+	LockReleaseSession(USER_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	LockAssertNoneHeld(false);
+#endif
+
 	ResetPlanCache();
 	ResetTempTableNamespace();
 	ResetSequenceCaches();
diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c
index 501910b4454..835f112f751 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -841,7 +841,11 @@ logicalrep_worker_onexit(int code, Datum arg)
 	 * The locks will be acquired once the worker is initialized.
 	 */
 	if (!InitializingApplyWorker)
-		LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+		LockReleaseSession(DEFAULT_LOCKMETHOD);
+
+#ifdef USE_ASSERT_CHECKING
+	LockAssertNoneHeld(false);
+#endif
 
 	ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index 45de0fd2bd6..e7e0f29347a 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
 subset of the PGPROC object's heldLocks mask (if the PGPROC is
 currently waiting for another lock mode on this lock).
 
-releaseMask -
-A bitmask for the lock modes due to be released during LockReleaseAll.
-This must be a subset of the holdMask.  Note that it is modified without
-taking the partition LWLock, and therefore it is unsafe for any
-backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
 List link for shared memory queue of all the PROCLOCK objects for the
 same LOCK.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index b8c57b3e16e..ce54b589b0d 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  *	Interface:
  *
  *	InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- *	LockAcquire(), LockRelease(), LockReleaseAll(),
- *	LockCheckConflicts(), GrantLock()
+ *	LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-
  */
@@ -162,6 +161,16 @@ typedef struct TwoPhaseLockRecord
 	LOCKMODE	lockmode;
 } TwoPhaseLockRecord;
 
+/*
+ * Used by for a hash table entry type in AtPrepare_Locks() to communicate the
+ * session/xact lock status of each held lock for use in PostPrepare_Locks().
+ */
+typedef struct

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-09-17 Thread David Rowley

On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas  wrote:
> > I've added a call to LockAssertNoneHeld(false) in there.
>
> I don't see it in the patch?

hmm. I must've git format-patch before committing that part.

I'll try that again... see attached.

David


v8-0001-wip-resowner-lock-release-all.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-09-15 Thread Heikki Linnakangas


On 11/09/2023 15:00, David Rowley wrote:

On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas  wrote:



index 296dc82d2ee..edb8b6026e5 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
 Async_UnlistenAll();
-   LockReleaseAll(USER_LOCKMETHOD, true);
+   LockReleaseSession(USER_LOCKMETHOD);
 ResetPlanCache();


This assumes that there are no transaction-level advisory locks. I think
that's OK. It took me a while to convince myself of that, though. I
think we need a high level comment somewhere that explains what
assumptions we make on which locks can be held in session mode and which
in transaction mode.


Isn't it ok because DISCARD ALL cannot run inside a transaction block,
so there should be no locks taken apart from possibly session-level
locks?


Hmm, sounds valid. I think I convinced myself that it's OK through some 
other reasoning, but I don't remember it now.



I've added a call to LockAssertNoneHeld(false) in there.


I don't see it in the patch?

--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-09-11 Thread David Rowley

Thank you for having a look at this. Apologies for not getting back to
you sooner.

On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas  wrote:
>
> On 10/02/2023 04:51, David Rowley wrote:
> > I've attached another set of patches. I do need to spend longer
> > looking at this. I'm mainly attaching these as CI seems to be
> > highlighting a problem that I'm unable to recreate locally and I
> > wanted to see if the attached fixes it.
>
> I like this patch's approach.
>
> > index 296dc82d2ee..edb8b6026e5 100644
> > --- a/src/backend/commands/discard.c
> > +++ b/src/backend/commands/discard.c
> > @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
> > Async_UnlistenAll();
> > -   LockReleaseAll(USER_LOCKMETHOD, true);
> > +   LockReleaseSession(USER_LOCKMETHOD);
> > ResetPlanCache();
>
> This assumes that there are no transaction-level advisory locks. I think
> that's OK. It took me a while to convince myself of that, though. I
> think we need a high level comment somewhere that explains what
> assumptions we make on which locks can be held in session mode and which
> in transaction mode.

Isn't it ok because DISCARD ALL cannot run inside a transaction block,
so there should be no locks taken apart from possibly session-level
locks?

I've added a call to LockAssertNoneHeld(false) in there.

> > @@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid)
> > Assert(lock->nGranted <= lock->nRequested);
> > Assert((proclock->holdMask & ~lock->grantMask) == 
> > 0);
> >
> > -   /* Ignore it if nothing to release (must be a 
> > session lock) */
> > -   if (proclock->releaseMask == 0)
> > -   continue;
> > -
> > -   /* Else we should be releasing all locks */
> > -   if (proclock->releaseMask != proclock->holdMask)
> > -   elog(PANIC, "we seem to have dropped a bit 
> > somewhere");
> > -
> > /*
> >  * We cannot simply modify proclock->tag.myProc to 
> > reassign
> >  * ownership of the lock, because that's part of 
> > the hash key and
>
> This looks wrong. If you prepare a transaction that is holding any
> session locks, we will now transfer them to the prepared transaction.
> And its locallock entry will be out of sync. To fix, I think we could
> keep around the hash table that CheckForSessionAndXactLocks() builds,
> and use that here.

Good catch.  I've modified the patch to keep the hash table built in
CheckForSessionAndXactLocks around for longer so that we can check for
session locks.

I've attached an updated patch mainly to get CI checking this. I
suspect something is wrong as subscription/015_stream is timing out.
I've not gotten to the bottom of that yet.

David


v7-0001-wip-resowner-lock-release-all.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-07-05 Thread Heikki Linnakangas


On 10/02/2023 04:51, David Rowley wrote:

I've attached another set of patches. I do need to spend longer
looking at this. I'm mainly attaching these as CI seems to be
highlighting a problem that I'm unable to recreate locally and I
wanted to see if the attached fixes it.


I like this patch's approach.


index 296dc82d2ee..edb8b6026e5 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
Async_UnlistenAll();
-   LockReleaseAll(USER_LOCKMETHOD, true);
+   LockReleaseSession(USER_LOCKMETHOD);
ResetPlanCache();


This assumes that there are no transaction-level advisory locks. I think 
that's OK. It took me a while to convince myself of that, though. I 
think we need a high level comment somewhere that explains what 
assumptions we make on which locks can be held in session mode and which 
in transaction mode.



@@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid)
Assert(lock->nGranted <= lock->nRequested);
Assert((proclock->holdMask & ~lock->grantMask) == 0);
 
-   /* Ignore it if nothing to release (must be a session lock) */

-   if (proclock->releaseMask == 0)
-   continue;
-
-   /* Else we should be releasing all locks */
-   if (proclock->releaseMask != proclock->holdMask)
-   elog(PANIC, "we seem to have dropped a bit 
somewhere");
-
/*
 * We cannot simply modify proclock->tag.myProc to 
reassign
 * ownership of the lock, because that's part of the 
hash key and


This looks wrong. If you prepare a transaction that is holding any 
session locks, we will now transfer them to the prepared transaction. 
And its locallock entry will be out of sync. To fix, I think we could 
keep around the hash table that CheckForSessionAndXactLocks() builds, 
and use that here.


--
Heikki Linnakangas
Neon (https://neon.tech)

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-02-09 Thread David Rowley

Thanks for having a look at this.

On Wed, 1 Feb 2023 at 03:07, Amit Langote  wrote:
> Maybe you're planning to do it once this patch is post the PoC phase
> (isn't it?), but it would be helpful to have commentary on all the new
> dlist fields.

I've added comments on the new fields.  Maybe we can say the patch is "wip".

> This seems to be replacing what is a cache with an upper limit on the
> number of cacheds locks with something that has no limit on how many
> per-owner locks are remembered.  I wonder whether we'd be doing
> additional work in some cases with the new no-limit implementation
> that wasn't being done before (where the owner's locks array is
> overflowed) or maybe not much because the new implementation of
> ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist
> node from the owner's dlist?

It's a good question. The problem is I don't really have a good test
to find out.  The problem is we'd need to benchmark taking fewer than
16 locks. On trying that, I find that there's just too much
variability in the performance between runs to determine if there's
any slowdown.

$ cat 10_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,10) x;

$ pgbench -f 10_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps)"
tps = 47809.306088 (without initial connection time)
tps = 66859.789072 (without initial connection time)
tps = 37885.924616 (without initial connection time)

On trying with more locks, I see there are good wins from the patched version.

$ cat 100_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,100) x;

$ cat 1k_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,1000) x;

$ cat 10k_locks.sql
select count(pg_advisory_lock(x)) from generate_series(1,1) x;

Test 1:  Take 100 locks but periodically take 10k locks to bloat the
local lock table.

master:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 2726.197037 (without initial connection time)
SQL script 1: 100_locks.sql
 - 27219 transactions (99.9% of total, tps = 2722.496227)
SQL script 2: 10k_locks.sql
 - 37 transactions (0.1% of total, tps = 3.700810)

patched:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 34047.297822 (without initial connection time)
SQL script 1: 100_locks.sql
 - 340039 transactions (99.9% of total, tps = 34012.688879)
SQL script 2: 10k_locks.sql
 - 346 transactions (0.1% of total, tps = 34.608943)

patched without slab context:
$ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10
-n postgres | grep -E "(tps|script)"
transaction type: multiple scripts
tps = 34851.770846 (without initial connection time)
SQL script 1: 100_locks.sql
 - 348097 transactions (99.9% of total, tps = 34818.662324)
SQL script 2: 10k_locks.sql
 - 331 transactions (0.1% of total, tps = 33.108522)

Test 2: Always take just 100 locks and don't bloat the local lock table.

master:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 32682.491548 (without initial connection time)

patched:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 35637.241815 (without initial connection time)

patched without slab context:
$ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep
-E "(tps|script)"
tps = 36192.185181 (without initial connection time)

The attached 0003 patch is an experiment to see if using a slab memory
context has any advantages for storing the LOCALLOCKOWNER structs.
There seems to be a small performance hit from doing this.

> The following comment is now obsolete:
>
> /*
>  * LockReassignCurrentOwner
>  *  Reassign all locks belonging to CurrentResourceOwner to belong
>  *  to its parent resource owner.
>  *
>  * If the caller knows what those locks are, it can pass them as an array.
>  * That speeds up the call significantly, when a lot of locks are held
>  * (e.g pg_dump with a large schema).  Otherwise, pass NULL for locallocks,
>  * and we'll traverse through our hash table to find them.
>  */

I've removed the obsolete part.

I've attached another set of patches. I do need to spend longer
looking at this. I'm mainly attaching these as CI seems to be
highlighting a problem that I'm unable to recreate locally and I
wanted to see if the attached fixes it.

David
From 4a546ad6d33e544dd872b23a94925a262088cd9a Mon Sep 17 00:00:00 2001
From: David Rowley 
Date: Tue, 24 Jan 2023 16:00:58 +1300
Subject: [PATCH v6 1/3] wip-resowner-lock-release-all.

---
 src/backend/commands/discard.c |   2 +-
 src/backend/replication/logical/launcher.c |   2 +-
 src/backend/storage/lmgr/README|   6 -
 src/backend/storage/lmgr/lock.c| 650 ++---
 src/backend/storage/lmgr/proc.c|  17 +-

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-02-07 Thread Andres Freund

Hi,

On 2023-01-24 16:57:37 +1300, David Rowley wrote:
> I've attached a rebased patch.

Looks like there's some issue causing tests to fail probabilistically:

https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3501

Several failures are when testing a 32bit build.



> While reading over this again, I wondered if instead of allocating the
> memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should
> create a Slab context as a child of TopMemoryContext and perform the
> allocations there.

Yes, that does make sense.


> I would like to get this LockReleaseAll problem finally fixed in PG16,
> but I'd feel much better about this patch if it had some review from
> someone who has more in-depth knowledge of the locking code.

I feel my review wouldn't be independent, but I'll give it a shot if nobody
else does.

Greetings,

Andres Freund

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-01-31 Thread Amit Langote

Hi David,

On Tue, Jan 24, 2023 at 12:58 PM David Rowley  wrote:
> On Fri, 20 Jan 2023 at 00:26, vignesh C  wrote:
> > CFBot shows some compilation errors as in [1], please post an updated
> > version for the same:
>
> I've attached a rebased patch.

Thanks for the new patch.

Maybe you're planning to do it once this patch is post the PoC phase
(isn't it?), but it would be helpful to have commentary on all the new
dlist fields.

Especially, I think the following warrants a bit more explanation than other:

-   /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */
-   int nlocks; /* number of owned locks */
-   LOCALLOCK  *locks[MAX_RESOWNER_LOCKS];  /* list of owned locks */
+   dlist_head  locks;  /* dlist of owned locks */

This seems to be replacing what is a cache with an upper limit on the
number of cacheds locks with something that has no limit on how many
per-owner locks are remembered.  I wonder whether we'd be doing
additional work in some cases with the new no-limit implementation
that wasn't being done before (where the owner's locks array is
overflowed) or maybe not much because the new implementation of
ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist
node from the owner's dlist?

The following comment is now obsolete:

/*
 * LockReassignCurrentOwner
 *  Reassign all locks belonging to CurrentResourceOwner to belong
 *  to its parent resource owner.
 *
 * If the caller knows what those locks are, it can pass them as an array.
 * That speeds up the call significantly, when a lot of locks are held
 * (e.g pg_dump with a large schema).  Otherwise, pass NULL for locallocks,
 * and we'll traverse through our hash table to find them.
 */

--
Thanks, Amit Langote
EDB: http://www.enterprisedb.com

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-01-23 Thread David Rowley

On Fri, 20 Jan 2023 at 00:26, vignesh C  wrote:
> CFBot shows some compilation errors as in [1], please post an updated
> version for the same:

I've attached a rebased patch.

While reading over this again, I wondered if instead of allocating the
memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should
create a Slab context as a child of TopMemoryContext and perform the
allocations there.  I feel like slab might be a better option here as
it'll use slightly less memory due to it not rounding up allocations
to the next power of 2. sizeof(LOCALLOCKOWNER) == 56, so it's not a
great deal of memory, but more than nothing. The primary reason that I
think this might be a good idea is mostly around better handling of
chunk on block fragmentation in slab.c than aset.c. If we have
transactions which create a large number of locks then we may end up
growing the TopMemoryContext and never releasing the AllocBlocks and
just having a high number of 64-byte chunks left on the freelist
that'll maybe never be used again. I'm thinking slab.c might handle
that better as it'll only keep around 10 completely empty SlabBlocks
before it'll start free'ing them. The slab allocator is quite a bit
faster now as a result of d21ded75f.

I would like to get this LockReleaseAll problem finally fixed in PG16,
but I'd feel much better about this patch if it had some review from
someone who has more in-depth knowledge of the locking code.

I've also gone and adjusted all the places that upgraded the
elog(WARNING)s of local table corruption to PANIC and put them back to
use WARNING again. While I think it might be a good idea to do that,
it seems to be adding a bit more resistance to this patch which I
don't think it really needs. Maybe we can consider that in a separate
effort.

David
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index 296dc82d2e..edb8b6026e 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
ResetAllOptions();
DropAllPreparedStatements();
Async_UnlistenAll();
-   LockReleaseAll(USER_LOCKMETHOD, true);
+   LockReleaseSession(USER_LOCKMETHOD);
ResetPlanCache();
ResetTempTableNamespace();
ResetSequenceCaches();
diff --git a/src/backend/replication/logical/launcher.c 
b/src/backend/replication/logical/launcher.c
index 564bffe5ca..20b2e3497e 100644
--- a/src/backend/replication/logical/launcher.c
+++ b/src/backend/replication/logical/launcher.c
@@ -798,7 +798,7 @@ logicalrep_worker_onexit(int code, Datum arg)
 * parallel apply mode and will not be released when the worker
 * terminates, so manually release all locks before the worker exits.
 */
-   LockReleaseAll(DEFAULT_LOCKMETHOD, true);
+   LockReleaseSession(DEFAULT_LOCKMETHOD);

ApplyLauncherWakeup();
 }
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index d08ec6c402..9603cc8959 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
 subset of the PGPROC object's heldLocks mask (if the PGPROC is
 currently waiting for another lock mode on this lock).

-releaseMask -
-A bitmask for the lock modes due to be released during LockReleaseAll.
-This must be a subset of the holdMask.  Note that it is modified without
-taking the partition LWLock, and therefore it is unsafe for any
-backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
 List link for shared memory queue of all the PROCLOCK objects for the
 same LOCK.
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 49d62a0dc7..9d9d27e0c9 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  * Interface:
  *
  * InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- * LockAcquire(), LockRelease(), LockReleaseAll(),
- * LockCheckConflicts(), GrantLock()
+ * LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-
  */
@@ -289,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;

+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG

 /*--
@@ -375,8 +377,8 @@ static void GrantLockLocal(LOCALLOCK *locallock, 
ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool 
sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner,

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2023-01-19 Thread vignesh C

On Wed, 3 Aug 2022 at 09:04, David Rowley  wrote:
>
> On Wed, 3 Aug 2022 at 07:04, Jacob Champion  wrote:
> > This entry has been waiting on author input for a while (our current
> > threshold is roughly two weeks), so I've marked it Returned with
> > Feedback.
>
> Thanks for taking care of this. You dealt with this correctly based on
> the fact that I'd failed to rebase before or during the entire July
> CF.
>
> I'm still interested in having the LockReleaseAll slowness fixed, so
> here's a rebased patch.

CFBot shows some compilation errors as in [1], please post an updated
version for the same:
[15:40:00.287] [1239/1809] Linking target src/backend/postgres
[15:40:00.287] FAILED: src/backend/postgres
[15:40:00.287] cc @src/backend/postgres.rsp
[15:40:00.287] /usr/bin/ld:
src/backend/postgres_lib.a.p/replication_logical_launcher.c.o: in
function `logicalrep_worker_onexit':
[15:40:00.287] 
/tmp/cirrus-ci-build/build/../src/backend/replication/logical/launcher.c:773:
undefined reference to `LockReleaseAll'
[15:40:00.287] collect2: error: ld returned 1 exit status

[1] - https://cirrus-ci.com/task/4562493863886848

Regards,
Vignesh

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-11-15 Thread David Rowley

Thank you for looking at the patch.

On Fri, 4 Nov 2022 at 04:43, Ankit Kumar Pandey  wrote:
> I don't see any performance improvement in tests.

Are you able to share what your test was?

In order to see a performance improvement you're likely going to have
to obtain a large number of locks in the session so that the local
lock table becomes bloated, then continue to run some fast query and
observe that LockReleaseAll has become slower as a result of the hash
table becoming bloated.  Running pgbench running a SELECT on a hash
partitioned table with a good number of partitions to look up a single
row with -M prepared.  The reason this becomes slow is that the
planner will try a generic plan on the 6th execution which will lock
every partition and bloat the local lock table. From then on it will
use a custom plan which only locks a single leaf partition.

I just tried the following:

$ pgbench -i --partition-method=hash --partitions=1000 postgres

Master:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 21286.172326 (without initial connection time)

Patched:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 23034.063261 (without initial connection time)

If I try again with 10,000 partitions, I get:

Master:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 13044.290903 (without initial connection time)

Patched:
$ pgbench -T 60 -S -M prepared postgres | grep tps
tps = 22683.545686 (without initial connection time)

David

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-11-03 Thread Ankit Kumar Pandey

Hi David,

This is review of speed up releasing of locks patch.

Contents & Purpose:
Subject is missing in patch. It would have been easier to understand purpose 
had it been included.
Included in the patch are change in README, but no new tests are included..

Initial Run:
The patch applies cleanly to HEAD. The regression tests all pass
successfully against the new patch.

Nitpicking & conclusion:
I don't see any performance improvement in tests. Lots of comments
were removed which were not fully replaced. Change of log level for 
ReleaseLockIfHeld: failed
from warning to panic is mystery. 
Change in readme doesn't look right.
`Any subsequent lockers are share lockers wait 
waiting for the VXID to terminate via some other method) is for deadlock`. This 
sentence could be rewritten.
Also more comments could be added to explain new methods added.

Thanks,
Ankit

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-08-02 Thread David Rowley

On Wed, 3 Aug 2022 at 07:04, Jacob Champion  wrote:
> This entry has been waiting on author input for a while (our current
> threshold is roughly two weeks), so I've marked it Returned with
> Feedback.

Thanks for taking care of this. You dealt with this correctly based on
the fact that I'd failed to rebase before or during the entire July
CF.

I'm still interested in having the LockReleaseAll slowness fixed, so
here's a rebased patch.

David
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index c583539e0c..2c43b9e0c8 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
ResetAllOptions();
DropAllPreparedStatements();
Async_UnlistenAll();
-   LockReleaseAll(USER_LOCKMETHOD, true);
+   LockReleaseSession(USER_LOCKMETHOD);
ResetPlanCache();
ResetTempTableNamespace();
ResetSequenceCaches();
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index d08ec6c402..563ba681e5 100644
--- a/src/backend/storage/lmgr/README
+++ b/src/backend/storage/lmgr/README
@@ -182,12 +182,6 @@ holdMask -
 subset of the PGPROC object's heldLocks mask (if the PGPROC is
 currently waiting for another lock mode on this lock).
 
-releaseMask -
-A bitmask for the lock modes due to be released during LockReleaseAll.
-This must be a subset of the holdMask.  Note that it is modified without
-taking the partition LWLock, and therefore it is unsafe for any
-backend except the one owning the PROCLOCK to examine/change it.
-
 lockLink -
 List link for shared memory queue of all the PROCLOCK objects for the
 same LOCK.
@@ -321,8 +315,7 @@ and will notice any weak lock we take when it does.
 
 Fast-path VXID locks do not use the FastPathStrongRelationLocks table.  The
 first lock taken on a VXID is always the ExclusiveLock taken by its owner.
-Any subsequent lockers are share lockers waiting for the VXID to terminate.
-Indeed, the only reason VXID locks use the lock manager at all (rather than
+Any subsequent lockers are share lockers wait 
 waiting for the VXID to terminate via some other method) is for deadlock
 detection.  Thus, the initial VXID lock can *always* be taken via the fast
 path without checking for conflicts.  Any subsequent locker must check
diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c
index 5f5803f681..381a2527f7 100644
--- a/src/backend/storage/lmgr/lock.c
+++ b/src/backend/storage/lmgr/lock.c
@@ -22,8 +22,7 @@
  * Interface:
  *
  * InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(),
- * LockAcquire(), LockRelease(), LockReleaseAll(),
- * LockCheckConflicts(), GrantLock()
+ * LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock()
  *
  *-
  */
@@ -289,6 +288,9 @@ static LOCALLOCK *awaitedLock;
 static ResourceOwner awaitedOwner;
 
 
+static dlist_head session_locks[lengthof(LockMethods)];
+
+
 #ifdef LOCK_DEBUG
 
 /*--
@@ -375,8 +377,9 @@ static void GrantLockLocal(LOCALLOCK *locallock, 
ResourceOwner owner);
 static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode);
 static void FinishStrongLockAcquire(void);
 static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner);
-static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock);
-static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent);
+static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool 
sessionLock);
+static void LockReassignOwner(LOCALLOCKOWNER *locallockowner,
+ ResourceOwner parent);
 static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode,
PROCLOCK *proclock, LockMethod 
lockMethodTable);
 static void CleanUpLock(LOCK *lock, PROCLOCK *proclock,
@@ -477,6 +480,10 @@ InitLocks(void)
  16,
  ,
  
HASH_ELEM | HASH_BLOBS);
+
+   /* Initialize each element of the session_locks array */
+   for (int i = 0; i < lengthof(LockMethods); i++)
+   dlist_init(_locks[i]);
 }
 
 
@@ -701,7 +708,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, 
bool sessionLock)
{
PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock);
LWLockRelease(partitionLock);
-   elog(WARNING, "you don't own a lock of type %s",
+   elog(PANIC, "you don't own a lock of type %s",
 lockMethodTable->lockModeNames[lockmode]);
RemoveLocalLock(locallock);
return false;
@@ -839,26 +846,9 @@ LockAcquireExtended(const LOCKTAG

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-08-02 Thread Jacob Champion

This entry has been waiting on author input for a while (our current
threshold is roughly two weeks), so I've marked it Returned with
Feedback.

Once you think the patchset is ready for review again, you (or any
interested party) can resurrect the patch entry by visiting

https://commitfest.postgresql.org/38/3501/

and changing the status to "Needs Review", and then changing the
status again to "Move to next CF". (Don't forget the second step;
hopefully we will have streamlined this in the near future!)

Thanks,
--Jacob

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-04-06 Thread David Rowley

On Wed, 6 Apr 2022 at 03:40, Yura Sokolov  wrote:
> I'm looking on patch and don't get some moments.
>
> `GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into
> `locallock->locallockowners`. It links it regardless `owner` could be
> NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != 
> NULL);`.
> Why it should not fail?
>
> `GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`.
> But there is single `pfree(locallockowner)` in `LockReassignOwner`.
> Looks like there should be more `pfree`. Shouldn't they?
>
> `GrantLockLocal` does `dlist_push_tail`, but isn't it better to
> do `dlist_push_head`? Resource owners usually form stack, so usually
> when owner searches for itself it is last added to list.
> Then `dlist_foreach` will find it sooner if it were added to the head.

Thanks for having a look at this.  It's a bit unrealistic for me to
get a look at addressing these for v15. I've pushed this one out to
the next CF.

David

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-04-05 Thread Yura Sokolov

Good day, David.

I'm looking on patch and don't get some moments.

`GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into
`locallock->locallockowners`. It links it regardless `owner` could be
NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`.
Why it should not fail?

`GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`.
But there is single `pfree(locallockowner)` in `LockReassignOwner`.
Looks like there should be more `pfree`. Shouldn't they?

`GrantLockLocal` does `dlist_push_tail`, but isn't it better to
do `dlist_push_head`? Resource owners usually form stack, so usually
when owner searches for itself it is last added to list.
Then `dlist_foreach` will find it sooner if it were added to the head.

regards

-

Yura Sokolov
Postgres Professional
y.soko...@postgrespro.ru
funny.fal...@gmail.com

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2022-03-14 Thread David Rowley

On Sat, 1 Jan 2022 at 15:40, Zhihong Yu  wrote:
> +   locallock->nLocks -= locallockowner->nLocks;
> +   Assert(locallock->nLocks >= 0);
>
> I think the assertion is not needed since the above code is in if block :
>
> +   if (locallockowner->nLocks < locallock->nLocks)
>
> the condition, locallock->nLocks >= 0, would always hold after the 
> subtraction.

That makes sense. I've removed the Assert in the attached patch.
Thanks for looking at the patch

I've also spent a bit more time on the patch. There were quite a few
outdated comments remaining. Also, the PROCLOCK releaseMask field
appears to no longer be needed.

I also did a round of benchmarking on the attached patch using a very
recent master. I've attached .sql files and the script I used to
benchmark.

With 1024 partitions, lock1.sql shows about a 4.75% performance
increase. This would become larger with more partitions and less with
fewer partitions.
With the patch, lock2.sql about a 10% performance increase over master.
lock3.sql does not seem to have changed much and lock4.sql shows a
small regression with the patch of about 2.5%.

I'm not quite sure how worried we should be about lock4.sql slowing
down lightly. 2.5% is fairly small given how hard I'm exercising the
locking code in that test. There's also nothing much to say that the
slowdown is not just due to code alignment changes.

I also understand that Amit L is working on another patch that will
improve the situation for lock1.sql by not taking the locks on
relations that will be run-time pruned at executor startup.  I think
it's still worth solving this regardless of Amit's patch as with
current master we still have a situation where short fast queries
which access a small number of tables can become slower once the
backend has obtained a large number of locks concurrently and bloated
the locallocktable.

As for the patch. I feel it's a pretty invasive change to how we
release locks and the resowner code. I'd be quite happy for some
review of it.

Here are the full results as output by the attached script.

drowley@amd3990x:~$ echo master
master
drowley@amd3990x:~$ ./lockbench.sh
lock1.sql
tps = 38078.011433 (without initial connection time)
tps = 38070.016792 (without initial connection time)
tps = 39223.118769 (without initial connection time)
tps = 37510.105287 (without initial connection time)
tps = 38164.394128 (without initial connection time)
lock2.sql
tps = 247.963797 (without initial connection time)
tps = 247.374174 (without initial connection time)
tps = 248.412744 (without initial connection time)
tps = 248.192629 (without initial connection time)
tps = 248.503728 (without initial connection time)
lock3.sql
tps = 1162.937692 (without initial connection time)
tps = 1160.968689 (without initial connection time)
tps = 1166.908643 (without initial connection time)
tps = 1160.288547 (without initial connection time)
tps = 1160.336572 (without initial connection time)
lock4.sql
tps = 282.173560 (without initial connection time)
tps = 284.470330 (without initial connection time)
tps = 286.089644 (without initial connection time)
tps = 285.548487 (without initial connection time)
tps = 284.313505 (without initial connection time)

drowley@amd3990x:~$ echo Patched
Patched
drowley@amd3990x:~$ ./lockbench.sh
lock1.sql
tps = 40338.975219 (without initial connection time)
tps = 39803.433365 (without initial connection time)
tps = 39504.824194 (without initial connection time)
tps = 39843.422438 (without initial connection time)
tps = 40624.483013 (without initial connection time)
lock2.sql
tps = 274.413309 (without initial connection time)
tps = 271.978813 (without initial connection time)
tps = 275.795091 (without initial connection time)
tps = 273.628649 (without initial connection time)
tps = 273.049977 (without initial connection time)
lock3.sql
tps = 1168.557054 (without initial connection time)
tps = 1168.139469 (without initial connection time)
tps = 1166.366440 (without initial connection time)
tps = 1165.464214 (without initial connection time)
tps = 1167.250809 (without initial connection time)
lock4.sql
tps = 274.842298 (without initial connection time)
tps = 277.911394 (without initial connection time)
tps = 278.702620 (without initial connection time)
tps = 275.715606 (without initial connection time)
tps = 278.816060 (without initial connection time)

David
diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c
index c583539e0c..2c43b9e0c8 100644
--- a/src/backend/commands/discard.c
+++ b/src/backend/commands/discard.c
@@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel)
ResetAllOptions();
DropAllPreparedStatements();
Async_UnlistenAll();
-   LockReleaseAll(USER_LOCKMETHOD, true);
+   LockReleaseSession(USER_LOCKMETHOD);
ResetPlanCache();
ResetTempTableNamespace();
ResetSequenceCaches();
diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README
index

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-12-31 Thread Zhihong Yu

On Fri, Dec 31, 2021 at 5:45 PM David Rowley  wrote:

> On Fri, 3 Dec 2021 at 20:36, Michael Paquier  wrote:
> > Two months later, this has been switched to RwF.
>
> I was discussing this patch with Andres. He's not very keen on my
> densehash hash table idea and suggested that instead of relying on
> trying to make the hash table iteration faster, why don't we just
> ditch the lossiness of the resource owner code which only records the
> first 16 locks, and just make it have a linked list of all locks.
>
> This is a little more along the lines of the original patch, however,
> it does not increase the size of the LOCALLOCK struct.
>
> I've attached a patch which does this.  This was mostly written
> (quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
> fixed a few bugs.
>
> I ran the same performance tests on this patch as I did back in [1]:
>
> -- Test 1. Select 1 record from a 140 partitioned table. Tests
> creating a large number of locks with a fast query.
>
> create table hp (a int, b int) partition by hash(a);
> select 'create table hp'||x||' partition of hp for values with
> (modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
> create index on hp (b);
> insert into hp select x,x from generate_series(1, 14) x;
> analyze hp;
>
> select3.sql: select * from hp where b = 1
>
> select3.sql master
>
> drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
> tps = 2099.708748 (without initial connection time)
> tps = 2100.398516 (without initial connection time)
> tps = 2094.882341 (without initial connection time)
> tps = 2113.218820 (without initial connection time)
> tps = 2104.717597 (without initial connection time)
>
> select3.sql patched
>
> drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
> tps = 2010.070738 (without initial connection time)
> tps = 1994.963606 (without initial connection time)
> tps = 1994.668849 (without initial connection time)
> tps = 1995.948168 (without initial connection time)
> tps = 1985.650945 (without initial connection time)
>
> You can see that there's a performance regression here. I've not yet
> studied why this appears.
>
> -- Test 2. Tests a prepared query which will perform a generic plan on
> the 6th execution then fallback on a custom plan due to it pruning all
> but one partition.  Master suffers from the lock table becoming
> bloated after locking all partitions when planning the generic plan.
>
> create table ht (a int primary key, b int, c int) partition by hash (a);
> select 'create table ht' || x::text || ' partition of ht for values
> with (modulus 8192, remainder ' || (x)::text || ');' from
> generate_series(0,8191) x;
> \gexec
>
> select.sql:
> \set p 1
> select * from ht where a = :p
>
> select.sql master
> drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 18014.460090 (without initial connection time)
> tps = 17973.358889 (without initial connection time)
> tps = 17847.480647 (without initial connection time)
> tps = 18038.332507 (without initial connection time)
> tps = 17776.143206 (without initial connection time)
>
> select.sql patched
> drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 32393.457106 (without initial connection time)
> tps = 32277.204349 (without initial connection time)
> tps = 32160.719830 (without initial connection time)
> tps = 32530.038130 (without initial connection time)
> tps = 32299.019657 (without initial connection time)
>
> You can see that there are some quite good performance gains with this
> test.
>
> I'm going to add this to the January commitfest.
>
> David
>
> [1]
> https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com

Hi,

+   locallock->nLocks -= locallockowner->nLocks;
+   Assert(locallock->nLocks >= 0);

I think the assertion is not needed since the above code is in if block :

+   if (locallockowner->nLocks < locallock->nLocks)

the condition, locallock->nLocks >= 0, would always hold after the
subtraction.

Cheers

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-12-31 Thread David Rowley

On Fri, 3 Dec 2021 at 20:36, Michael Paquier  wrote:
> Two months later, this has been switched to RwF.

I was discussing this patch with Andres. He's not very keen on my
densehash hash table idea and suggested that instead of relying on
trying to make the hash table iteration faster, why don't we just
ditch the lossiness of the resource owner code which only records the
first 16 locks, and just make it have a linked list of all locks.

This is a little more along the lines of the original patch, however,
it does not increase the size of the LOCALLOCK struct.

I've attached a patch which does this.  This was mostly written
(quickly) by Andres, I just did a cleanup, fixed up a few mistakes and
fixed a few bugs.

I ran the same performance tests on this patch as I did back in [1]:

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 14) x;
analyze hp;

select3.sql: select * from hp where b = 1

select3.sql master

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2099.708748 (without initial connection time)
tps = 2100.398516 (without initial connection time)
tps = 2094.882341 (without initial connection time)
tps = 2113.218820 (without initial connection time)
tps = 2104.717597 (without initial connection time)

select3.sql patched

drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2010.070738 (without initial connection time)
tps = 1994.963606 (without initial connection time)
tps = 1994.668849 (without initial connection time)
tps = 1995.948168 (without initial connection time)
tps = 1985.650945 (without initial connection time)

You can see that there's a performance regression here. I've not yet
studied why this appears.

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition.  Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

select.sql master
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 18014.460090 (without initial connection time)
tps = 17973.358889 (without initial connection time)
tps = 17847.480647 (without initial connection time)
tps = 18038.332507 (without initial connection time)
tps = 17776.143206 (without initial connection time)

select.sql patched
drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 32393.457106 (without initial connection time)
tps = 32277.204349 (without initial connection time)
tps = 32160.719830 (without initial connection time)
tps = 32530.038130 (without initial connection time)
tps = 32299.019657 (without initial connection time)

You can see that there are some quite good performance gains with this test.

I'm going to add this to the January commitfest.

David

[1] 
https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com


speedup_releasing_all_locks.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-12-02 Thread Michael Paquier

On Fri, Oct 01, 2021 at 04:03:09PM +0900, Michael Paquier wrote:
> This last update was two months ago, and the patch has not moved
> since:
> https://commitfest.postgresql.org/34/3220/
> 
> Do you have plans to work more on that or perhaps the CF entry should
> be withdrawn or RwF'd?

Two months later, this has been switched to RwF.
--
Michael


signature.asc
Description: PGP signature

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-10-06 Thread Yura Sokolov

I've made some remarks in related thread:
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1FB976EF@G01JPEXMBYT05

The new status of this patch is: Waiting on Author

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-10-01 Thread Michael Paquier

On Tue, Jul 20, 2021 at 05:04:19PM +1200, David Rowley wrote:
> On Mon, 12 Jul 2021 at 19:23, David Rowley  wrote:
> > I also adjusted the hash seq scan code so that it performs better when
> > faced a non-sparsely populated table.  Previously my benchmark for
> > that case didn't do well [2].
> 
> I was running some select only pgbench tests today on an AMD 3990x
> machine with a large number of processes.
> 
> I saw that LockReleaseAll was coming up on the profile a bit at:

This last update was two months ago, and the patch has not moved
since:
https://commitfest.postgresql.org/34/3220/

Do you have plans to work more on that or perhaps the CF entry should
be withdrawn or RwF'd?
--
Michael


signature.asc
Description: PGP signature

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-07-19 Thread David Rowley

On Mon, 12 Jul 2021 at 19:23, David Rowley  wrote:
> I also adjusted the hash seq scan code so that it performs better when
> faced a non-sparsely populated table.  Previously my benchmark for
> that case didn't do well [2].

I was running some select only pgbench tests today on an AMD 3990x
machine with a large number of processes.

I saw that LockReleaseAll was coming up on the profile a bit at:

Master: 0.77% postgres [.] LockReleaseAll

I wondered if this patch would help, so I tried it and got:

dense hash lockrelease all: 0.67% postgres [.] LockReleaseAll

It's a very small increase which translated to about a 0.62% gain in
tps. It made me think it might be worth doing something about this
LockReleaseAll can show up when releasing small numbers of locks.

pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared --random-seed=12345 postgres

Units = tps

Secmaster   dense hash LockReleaseAll
10   3758201.2 3713521.5   98.81%
20   3810125.5 3844142.9 100.89%
30   3806505.1 3848458101.10%
40   3816094.8 3855706.6 101.04%
50   3820317.2 3851717.7 100.82%
60   3827809 3851499.4100.62%
70   3828757.9 3849312100.54%
80   3824492.1 3852378.8 100.73%
90   3816502.1 3854793.8 101.00%
100 3819124.1 3860418.6 101.08%
110 3816154.3 3845327.7 100.76%
120 3817070.5 3845842.5 100.75%
130 3815424.7 3847626100.84%
140 3823631.1 3846760.6 100.60%
150 3820963.8 3840196.6 100.50%
160 38277373841149.3 100.35%
170 3827779.2 3840130.9 100.32%
180  3829352   3842814.5 100.35%
190 3825518.3 3841991100.43%
200 3823477.2 3839390.7 100.42%
210 3809304.3 3836433.5 100.71%
220 3814328.5 3842073.7 100.73%
230 3811399.3 3843780.7 100.85%
avg 3816959.53   3840672.478 100.62%

David

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-07-12 Thread David Rowley

On Mon, 21 Jun 2021 at 01:56, David Rowley wrote:
> # A new hashtable implementation

> Also, it would be good to hear what people think about solving the
> problem this way.

Because over on [1] I'm also trying to improve the performance of
smgropen(), I posted the patch for the new hash table over there too.
Between that thread and discussions with Thomas and Andres off-list, I
get the idea that pretty much nobody likes the idea of having another
hash table implementation. Thomas wants to solve it another way and
Andres has concerns that working with bitmaps is too slow. Andres
suggested freelists would be faster, but I'm not really a fan of the
idea as, unless I have a freelist per array segment then I won't be
able to quickly identify the earliest segment slot to re-fill unused
slots with. That would mean memory would get more fragmented over time
instead of the fragmentation being slowly fixed as new items are added
after deletes. So I've not really tried implementing that to see how
it performs.

Both Andres and Thomas expressed a dislike to the name "generichash" too.

Anyway, since I did make a few small changes to the hash table
implementation before doing all that off-list talking, I thought I
should at least post where I got to here so that anything else that
comes up can get compared to where I got to, instead of where I was
with this.

I did end up renaming the hash table to "densehash" rather than
generichash. Naming is hard, but I went with dense as memory density
was on my mind when I wrote it. Having a compact 8-byte bucket width
and packing the data into arrays in a dense way. The word dense came
up a few times, so went with that.

I also adjusted the hash seq scan code so that it performs better when
faced a non-sparsely populated table. Previously my benchmark for
that case didn't do well [2].

I've attached the benchmark results from running the benchmark that's
included in hashbench.tar.bz2. I ran this 10 times using the included
test.sh with ./test.sh 10. I included the results I got on my AMD
machine in the attached bz2 file in results.csv.

You can see from the attached dense_vs_generic_vs_simple.png that
dense hash is quite comparable to simplehash for inserts/deletes and
lookups. It's not quite as fast as simplehash at iterations when the
table is not bloated, but blows simplehash out the water when the
hashtables have become bloated due to having once contained a large
number of records but no longer do.

Anyway, unless there is some interest in me taking this idea further
then, due to the general feedback received on [1], I'm not planning on
pushing this any further. I'll leave the commitfest entry as is for
now to give others a chance to see this.

David

[1]
https://postgr.es/m/CAApHDvowgRaQupC%3DL37iZPUzx1z7-N8deD7TxQSm8LR%2Bf4L3-A%40mail.gmail.com
[2]
https://www.postgresql.org/message-id/CAApHDvpuzJTQNKQ_bnAccvi-68xuh%2Bv87B4P6ycU-UiN0dqyTg%40mail.gmail.com

hashbench.tar.bz2
Description: Binary data

densehash_for_lockreleaseall.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-07-12 Thread David Rowley

Thanks for having a look at this.

On Mon, 21 Jun 2021 at 05:02, Zhihong Yu  wrote:
> + *   GH_ELEMENT_TYPE defines the data type that the hashtable stores.  Each
> + *   instance of GH_ELEMENT_TYPE which is stored in the hash table is done so
> + *   inside a GH_SEGMENT.
>
> I think the second sentence can be written as (since done means stored, it is 
> redundant):

I've rewords this entire paragraph slightly.

> Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a 
> GH_SEGMENT.
>
> + * Macros for translating a bucket's index into the segment and another to
> + * determine the item number within the segment.
> + */
> +#define GH_INDEX_SEGMENT(i)(i) / GH_ITEMS_PER_SEGMENT
>
> into the segment -> into the segment number (in the code I see segidx but I 
> wonder if segment index may cause slight confusion).

I've adjusted this comment

> +   GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item
> +* in the items array */
>
> 'A 1-bit' -> One bit (A and 1 mean the same)

I think you might have misread this. We're storing a 1-bit for each
used item rather than a 0-bit.  If I remove the 'A' then it's not
clear what the meaning of each bit's value is.

> +   uint32  first_free_segment;
>
> Since the segment may not be totally free, maybe name the field 
> first_segment_with_free_slot

I don't really like that. It feels too long to me.

> + * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on
> + * each GH_BITMAP_WORD.
>
> It seems the only difference from GH_NEXT_ONEBIT is in this line:
>
> +   GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */
>
> If a 4th parameter is added to signify whether the flipping should be done, 
> these two functions can be unified.

I don't want to do that.  I'd rather have them separate to ensure the
compiler does not create any additional needless branching. Those
functions are pretty hot.

> +* next insert will store in this segment.  If it's already pointing to an
> +* earlier segment, then leave it be.
>
> The last sentence is confusing: first_free_segment cannot point to some 
> segment and earlier segment at the same time.
> Maybe drop the last sentence.

I've adjusted this comment to become:

* Check if we need to lower the next_free_segment.  We want new inserts
* to fill the lower segments first, so only change the first_free_segment
* if the removed entry was stored in an earlier segment.

Thanks for having a look at this.

I'll attach an updated patch soon.

David

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-06-20 Thread Zhihong Yu

On Sun, Jun 20, 2021 at 6:56 AM David Rowley  wrote:

> On Wed, 14 Aug 2019 at 19:25, David Rowley 
> wrote:
> > For now, I'm out of ideas. If anyone else feels like suggesting
> > something of picking this up, feel free.
>
> This is a pretty old thread, so we might need a recap:
>
> # Recap
>
> Basically LockReleaseAll() becomes slow after a large number of locks
> have all been held at once by a backend.  The problem is that the
> LockMethodLocalHash dynahash table must grow to store all the locks
> and when later transactions only take a few locks, LockReleaseAll() is
> slow due to hash_seq_search() having to skip the sparsely filled
> buckets in the bloated hash table.
>
> The following things were tried on this thread. Each one failed:
>
> 1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
> Simply loop over the dlist in LockReleaseAll().
> 2) Try dropping and recreating the dynahash table if it becomes
> bloated using some heuristics to decide what "bloated" means and if
> recreating is worthwhile.
>
> #1 failed due to concerns with increasing the size of LOCALLOCK to
> store the dlist pointers. Performance regressions were seen too.
> Possibly due to size increase or additional overhead from pushing onto
> the dlist.
> #2 failed because it was difficult to agree on what the heuristics
> would be and we had no efficient way to determine the maximum number
> of locks that a given transaction held at any one time. We only know
> how many were still held at LockReleaseAll().
>
> There were also some suggestions to fix dynahash's hash_seq_search()
> slowness, and also a suggestion to try using simplehash.h instead of
> dynahash.c. Unfortunately simplehash.h would suffer the same issues as
> it too would have to skip over empty buckets in a sparsely populated
> hash table.
>
> I'd like to revive this effort as I have a new idea on how to solve the
> problem.
>
> # Background
>
> Over in [1] I'm trying to improve the performance of smgropen() during
> recovery.  The slowness there comes from the dynahash table lookups to
> find the correct SMgrRelation. Over there I proposed to use simplehash
> instead of dynahash because it's quite a good bit faster and far
> lessens the hash lookup overhead during recovery.  One problem on that
> thread is that relcache keeps a pointer into the SMgrRelation
> (RelationData.rd_smgr) and because simplehash moves things around
> during inserts and deletes, then we can't have anything point to
> simplehash entries, they're unstable.  I fixed that over on the other
> thread by having the simplehash entry point to a palloced
> SMgrRelationData... My problem is, I don't really like that idea as it
> means we need to palloc() pfree() lots of little chunks of memory.
>
> To fix the above, I really think we need a version of simplehash that
> has stable pointers.  Providing that implementation is faster than
> dynahash, then it will help solve the smgropen() slowness during
> recovery.
>
> # A new hashtable implementation
>
> I ended up thinking of this thread again because the implementation of
> the stable pointer hash that I ended up writing for [1] happens to be
> lightning fast for hash table sequential scans, even if the table has
> become bloated.  The reason the seq scans are so fast is that the
> implementation loops over the data arrays, which are tightly packed
> and store the actual data rather than pointers to the data. The code
> does not need to loop over the bucket array for this at all, so how
> large that has become is irrelevant to hash table seq scan
> performance.
>
> The patch stores elements in "segments" which is set to some power of
> 2 value.  When we run out of space to store new items in a segment, we
> just allocate another segment.  When we remove items from the table,
> new items reuse the first unused item in the first segment with free
> space.  This helps to keep the used elements tightly packed.  A
> segment keeps a bitmap of used items so that means scanning all used
> items is very fast.  If you flip the bits in the used_item bitmap,
> then you get a free list for that segment, so it's also very fast to
> find a free element when inserting a new item into the table.
>
> I've called the hash table implementation "generichash". It uses the
> same preprocessor tricks as simplehash.h does and borrows the same
> linear probing code that's used in simplehash.   The bucket array just
> stores the hash value and a uint32 index into the segment item that
> stores the data. Since segments store a power of 2 items, we can
> easily address both the segment number and the item within the segment
> from the single uint32 index value. The 'index' field just stores a
> special value when the bucket is empty. No need to add another field
> for that.  This means the bucket type is just 8 bytes wide.
>
> One thing I will mention about the new hash table implementation is
> that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
> that's

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2021-06-20 Thread David Rowley

On Wed, 14 Aug 2019 at 19:25, David Rowley  wrote:
> For now, I'm out of ideas. If anyone else feels like suggesting
> something of picking this up, feel free.

This is a pretty old thread, so we might need a recap:

# Recap

Basically LockReleaseAll() becomes slow after a large number of locks
have all been held at once by a backend.  The problem is that the
LockMethodLocalHash dynahash table must grow to store all the locks
and when later transactions only take a few locks, LockReleaseAll() is
slow due to hash_seq_search() having to skip the sparsely filled
buckets in the bloated hash table.

The following things were tried on this thread. Each one failed:

1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK.
Simply loop over the dlist in LockReleaseAll().
2) Try dropping and recreating the dynahash table if it becomes
bloated using some heuristics to decide what "bloated" means and if
recreating is worthwhile.

#1 failed due to concerns with increasing the size of LOCALLOCK to
store the dlist pointers. Performance regressions were seen too.
Possibly due to size increase or additional overhead from pushing onto
the dlist.
#2 failed because it was difficult to agree on what the heuristics
would be and we had no efficient way to determine the maximum number
of locks that a given transaction held at any one time. We only know
how many were still held at LockReleaseAll().

There were also some suggestions to fix dynahash's hash_seq_search()
slowness, and also a suggestion to try using simplehash.h instead of
dynahash.c. Unfortunately simplehash.h would suffer the same issues as
it too would have to skip over empty buckets in a sparsely populated
hash table.

I'd like to revive this effort as I have a new idea on how to solve the problem.

# Background

Over in [1] I'm trying to improve the performance of smgropen() during
recovery.  The slowness there comes from the dynahash table lookups to
find the correct SMgrRelation. Over there I proposed to use simplehash
instead of dynahash because it's quite a good bit faster and far
lessens the hash lookup overhead during recovery.  One problem on that
thread is that relcache keeps a pointer into the SMgrRelation
(RelationData.rd_smgr) and because simplehash moves things around
during inserts and deletes, then we can't have anything point to
simplehash entries, they're unstable.  I fixed that over on the other
thread by having the simplehash entry point to a palloced
SMgrRelationData... My problem is, I don't really like that idea as it
means we need to palloc() pfree() lots of little chunks of memory.

To fix the above, I really think we need a version of simplehash that
has stable pointers.  Providing that implementation is faster than
dynahash, then it will help solve the smgropen() slowness during
recovery.

# A new hashtable implementation

I ended up thinking of this thread again because the implementation of
the stable pointer hash that I ended up writing for [1] happens to be
lightning fast for hash table sequential scans, even if the table has
become bloated.  The reason the seq scans are so fast is that the
implementation loops over the data arrays, which are tightly packed
and store the actual data rather than pointers to the data. The code
does not need to loop over the bucket array for this at all, so how
large that has become is irrelevant to hash table seq scan
performance.

The patch stores elements in "segments" which is set to some power of
2 value.  When we run out of space to store new items in a segment, we
just allocate another segment.  When we remove items from the table,
new items reuse the first unused item in the first segment with free
space.  This helps to keep the used elements tightly packed.  A
segment keeps a bitmap of used items so that means scanning all used
items is very fast.  If you flip the bits in the used_item bitmap,
then you get a free list for that segment, so it's also very fast to
find a free element when inserting a new item into the table.

I've called the hash table implementation "generichash". It uses the
same preprocessor tricks as simplehash.h does and borrows the same
linear probing code that's used in simplehash.   The bucket array just
stores the hash value and a uint32 index into the segment item that
stores the data. Since segments store a power of 2 items, we can
easily address both the segment number and the item within the segment
from the single uint32 index value. The 'index' field just stores a
special value when the bucket is empty. No need to add another field
for that.  This means the bucket type is just 8 bytes wide.

One thing I will mention about the new hash table implementation is
that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means
that's the minimum size for the table.  I could drop this downto 64,
but that's still quite a bit more than the default size of the
dynahash table of 16. I think 16 is a bit on the low side and that it
might be worth making this 64 anyway.

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2020-02-01 Thread Tomas Vondra


Hi, the patch was in WoA since December, waiting for a rebase. I've
marked it as returned with feedback. Feel free to re-submit an updated
version into the next CF.

regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-11-30 Thread Michael Paquier

On Thu, Sep 26, 2019 at 07:11:53AM +, Tsunakawa, Takayuki wrote:
> I'm sorry to repeat what I mentioned in my previous mail, but my v2
> patch's approach is based on the database textbook and seems
> intuitive.  So I attached the rebased version. 

If you wish to do so, that's fine by me but I have not dived into the
details of the thread much.  Please not anyway that the patch does not
apply anymore and that it needs a rebase.  So for now I have moved the
patch to next CF, waiting on author. 
--
Michael


signature.asc
Description: PGP signature

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-26 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> On 2019-Sep-03, Tsunakawa, Takayuki wrote:
> > I don't think it's rejected.  It would be a pity (mottainai) to refuse
> > this, because it provides significant speedup despite its simple
> > modification.
> 
> I don't necessarily disagree with your argumentation, but Travis is
> complaining thusly:

I tried to revise David's latest patch (v8) and address Tom's comments in his 
last mail.  But I'm a bit at a loss.

First, to accurately count the maximum number of acquired locks in a 
transaction, we need to track the maximum entries in the hash table, and make 
it available via a new function like hash_get_max_entries().  However, to cover 
the shared partitioned hash table (that is not necessary for 
LockMethodLocalHash), we must add a spin lock in hashhdr and lock/unlock it 
when entering and removing entries in the hash table.  It spoils the effort to 
decrease contention by hashhdr->freelists[].mutex.  Do we want to track the 
maximum number of acquired locks in the global variable in lock.c, not in the 
hash table?

Second, I couldn't understand the comment about the fill factor well.  I can 
understand that it's not correct to compare the number of hash buckets and the 
number of locks.  But what can we do?

I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's 
approach is based on the database textbook and seems intuitive.  So I attached 
the rebased version.

Regards
Takayuki Tsunakawa

faster-locallock-scan_v3.patch
Description: faster-locallock-scan_v3.patch

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-25 Thread Alvaro Herrera

On 2019-Sep-03, Tsunakawa, Takayuki wrote:

> From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> > Hmm ... is this patch rejected, or is somebody still trying to get it to
> > committable state?  David, you're listed as committer.
> 
> I don't think it's rejected.  It would be a pity (mottainai) to refuse
> this, because it provides significant speedup despite its simple
> modification.

I don't necessarily disagree with your argumentation, but Travis is
complaining thusly:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement 
-Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security 
-fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -Wall -Werror 
-I../../../../src/include -I/usr/include/x86_64-linux-gnu -D_GNU_SOURCE   -c -o 
lock.o lock.c
1840lock.c:486:1: error: conflicting types for ‘TryShrinkLocalLockHash’
1841 TryShrinkLocalLockHash(long numLocksHeld)
1842 ^
1843lock.c:351:20: note: previous declaration of ‘TryShrinkLocalLockHash’ was 
here
1844 static inline void TryShrinkLocalLockHash(void);
1845^
1846: recipe for target 'lock.o' failed

Please fix.

-- 
Álvaro Herrerahttps://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-09-02 Thread Tsunakawa, Takayuki

From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com]
> Hmm ... is this patch rejected, or is somebody still trying to get it to
> committable state?  David, you're listed as committer.

I don't think it's rejected.  It would be a pity (mottainai) to refuse this, 
because it provides significant speedup despite its simple modification.

Again, I think the v2 patch is OK.  Tom's comment was as follows:


[Tom's comment against v2]

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at

I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.)  We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win.  I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD.  Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase.  That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.



* Extreme test case: 
Not extreme.  Two of our customers, who are considering to use PostgreSQL, are 
using thousands of partitions now.  We hit this issue -- a point query gets 
nearly 20% slower after automatically creating a generic plan.  That's the 
reason for this proposal.

* 0.5% slowdown with pgbench:
I think it's below the noise, as Tom said.

* sizeof(LOCALLOCK):
As Andres replied to Tom in the immediately following mail, LOCALLOCK was 
bigger in PG 11.

* Use case is narrow:
No.  The bloated LockMethodLocalHash affects the performance of the items below 
as well as transaction commit/abort:
  - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice 
in PREPARE!
  - LockReleaseSession: advisory lock
  - LockReleaseCurrentOwner: ??
  - LockReassignCurrentOwner: ??


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-08-15 Thread Tomas Vondra


On Wed, Aug 14, 2019 at 07:25:10PM +1200, David Rowley wrote:

On Thu, 25 Jul 2019 at 05:49, Tom Lane  wrote:

On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count.  That
seems way too likely to cause problems.


Thanks for having a look at this.  I've not addressed the points
you've mentioned due to what you mention above.  The only way I can
think of so far to resolve that would be to add something to track
peak lock usage.  The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable).   That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.



Would it really be a measurable overhead? I mean, we only really need
one int counter, and you don't need to do the check on every lock
acquisition - you just need to recheck on the first lock release. But
maybe I'm underestimating how expensive it is ...

Talking about dynahash - doesn't it already track this information?
Maybe not directly but surely it has to track the number of entries in
the hash table, in order to compute fill factor. Can't we piggy-back on
that and track the highest fill-factor for a particular period of time?


regards

--
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-08-14 Thread David Rowley

On Thu, 25 Jul 2019 at 05:49, Tom Lane  wrote:
> On the whole, I don't especially like this approach, because of the
> confusion between peak lock count and end-of-xact lock count.  That
> seems way too likely to cause problems.

Thanks for having a look at this.  I've not addressed the points
you've mentioned due to what you mention above.  The only way I can
think of so far to resolve that would be to add something to track
peak lock usage.  The best I can think of to do that, short of adding
something to dynahash.c is to check how many locks are held each time
we obtain a lock, then if that count is higher than the previous time
we checked, then update the maximum locks held, (probably a global
variable).   That seems pretty horrible to me and adds overhead each
time we obtain a lock, which is a pretty performance-critical path.

I've not tested what Andres mentioned about simplehash instead of
dynahash yet. I did a quick scan of simplehash and it looked like
SH_START_ITERATE would suffer the same problems as dynahash's
hash_seq_search(), albeit, perhaps with some more efficient memory
lookups. i.e it still has to skip over empty buckets, which might be
costly in a bloated table.

For now, I'm out of ideas. If anyone else feels like suggesting
something of picking this up, feel free.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-08-01 Thread Thomas Munro

On Thu, Jul 25, 2019 at 5:49 AM Tom Lane  wrote:
> David Rowley  writes:
> > Here's a more polished version with the debug code removed, complete
> > with benchmarks.
>
> A few gripes:
>
> [gripes]

Based on the above, I've set this to "Waiting on Author", in the next CF.

-- 
Thomas Munro
https://enterprisedb.com

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread Tom Lane

David Rowley  writes:
> Here's a more polished version with the debug code removed, complete
> with benchmarks.

A few gripes:

You're measuring the number of locks held at completion of the
transaction, which fails to account for locks transiently taken and
released, so that the actual peak usage will be more.  I think we mostly
only do that for system catalog accesses, so *maybe* the number of extra
locks involved isn't very large, but that's a very shaky assumption.

I don't especially like the fact that this seems to have a hard-wired
(and undocumented) assumption that buckets == entries, ie that the
fillfactor of the table is set at 1.0.  lock.c has no business knowing
that.  Perhaps instead of returning the raw bucket count, you could have
dynahash.c return bucket count times fillfactor, so that the number is in
the same units as the entry count.

This:
running_avg_locks -= running_avg_locks / 10.0;
running_avg_locks += numLocksHeld / 10.0;
seems like a weird way to do the calculation.  Personally I'd write
running_avg_locks += (numLocksHeld - running_avg_locks) / 10.0;
which is more the way I'm used to seeing exponential moving averages
computed --- at least, it seems clearer to me why this will converge
towards the average value of numLocksHeld over time.  It also makes
it clear that it wouldn't be sane to use two different divisors,
whereas your formulation makes it look like maybe they could be
set independently.

Your compiler might not complain that LockReleaseAll's numLocksHeld
is potentially uninitialized, but other people's compilers will.

On the whole, I don't especially like this approach, because of the
confusion between peak lock count and end-of-xact lock count.  That
seems way too likely to cause problems.

regards, tom lane

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread David Rowley

On Wed, 24 Jul 2019 at 16:16, David Rowley  wrote:
>
> On Wed, 24 Jul 2019 at 15:05, David Rowley  
> wrote:
> > To be able to reduce the threshold down again we'd need to make a
> > hash_get_num_entries(LockMethodLocalHash) call before performing the
> > guts of LockReleaseAll(). We could then weight that onto some running
> > average counter with a weight of, say... 10, so we react to changes
> > fairly quickly, but not instantly. We could then have some sort of
> > logic like "rebuild the hash table if running average 4 times less
> > than max_bucket"
> >
> > I've attached a spreadsheet of that idea and the algorithm we could
> > use to track the running average.  Initially, I've mocked it up a
> > series of transactions that use 1000 locks, then at row 123 dropped
> > that to 10 locks. If we assume the max_bucket is 1000, then it takes
> > until row 136 for the running average to drop below the max_bucket
> > count, i.e 13 xacts. There we'd reset there at the init size of 16. If
> > the average went up again, then we'd automatically expand the table as
> > we do now.  To make this work we'd need an additional call to
> > hash_get_num_entries(), before we release the locks, so there is more
> > overhead.
>
> Here's a patch with this implemented. I've left a NOTICE in there to
> make it easier for people to follow along at home and see when the
> lock table is reset.

Here's a more polished version with the debug code removed, complete
with benchmarks.

-- Test 1. Select 1 record from a 140 partitioned table. Tests
creating a large number of locks with a fast query.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 14) x;
analyze hp;

select3.sql: select * from hp where b = 1

-
Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2098.628975 (excluding connections establishing)
tps = 2101.797672 (excluding connections establishing)
tps = 2085.317292 (excluding connections establishing)
tps = 2094.931999 (excluding connections establishing)
tps = 2092.328908 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2101.691459 (excluding connections establishing)
tps = 2104.533249 (excluding connections establishing)
tps = 2106.499123 (excluding connections establishing)
tps = 2104.033459 (excluding connections establishing)
tps = 2105.463629 (excluding connections establishing)

(I'm surprised there is not more overhead in the additional tracking
added to calculate the running average)

-- Test 2. Tests a prepared query which will perform a generic plan on
the 6th execution then fallback on a custom plan due to it pruning all
but one partition.  Master suffers from the lock table becoming
bloated after locking all partitions when planning the generic plan.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10207.780843 (excluding connections establishing)
tps = 10205.772688 (excluding connections establishing)
tps = 10214.896449 (excluding connections establishing)
tps = 10157.572153 (excluding connections establishing)
tps = 10147.764477 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 14677.636570 (excluding connections establishing)
tps = 14661.437186 (excluding connections establishing)
tps = 14647.202877 (excluding connections establishing)
tps = 14784.165753 (excluding connections establishing)
tps = 14850.355344 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v8.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-24 Thread Andres Freund

Hi,

On 2019-07-21 21:37:28 +1200, David Rowley wrote:
> select.sql:
> \set p 1
> select * from ht where a = :p
> 
> Master:
> 
> $ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 10172.035036 (excluding connections establishing)
> tps = 10192.780529 (excluding connections establishing)
> tps = 10331.306003 (excluding connections establishing)
> 
> Patched:
> 
> $ pgbench -n -f select.sql -T 60 -M prepared postgres
> tps = 15080.765549 (excluding connections establishing)
> tps = 14994.404069 (excluding connections establishing)
> tps = 14982.923480 (excluding connections establishing)
> 
> That seems fine, 46% faster.
> 
> v6 is attached.
> 
> I plan to push this in a few days unless someone objects.

It does seem far less objectionable than the other case.  I hate to
throw in one more wrench into a topic finally making progress, but: Have
either of you considered just replacing the dynahash table with a
simplehash style one?  Given the obvious speed sensitivity, and the fact
that for it (in contrast to the shared lock table) no partitioning is
needed, that seems like a good thing to try.  It seems quite possible
that both the iteration and plain manipulations are going to be faster,
due to far less indirections - e.g. the iteration through the array will
just be an array walk with a known stride, far easier for the CPU to
prefetch.

Greetings,

Andres Freund

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley

On Wed, 24 Jul 2019 at 15:05, David Rowley  wrote:
> To be able to reduce the threshold down again we'd need to make a
> hash_get_num_entries(LockMethodLocalHash) call before performing the
> guts of LockReleaseAll(). We could then weight that onto some running
> average counter with a weight of, say... 10, so we react to changes
> fairly quickly, but not instantly. We could then have some sort of
> logic like "rebuild the hash table if running average 4 times less
> than max_bucket"
>
> I've attached a spreadsheet of that idea and the algorithm we could
> use to track the running average.  Initially, I've mocked it up a
> series of transactions that use 1000 locks, then at row 123 dropped
> that to 10 locks. If we assume the max_bucket is 1000, then it takes
> until row 136 for the running average to drop below the max_bucket
> count, i.e 13 xacts. There we'd reset there at the init size of 16. If
> the average went up again, then we'd automatically expand the table as
> we do now.  To make this work we'd need an additional call to
> hash_get_num_entries(), before we release the locks, so there is more
> overhead.

Here's a patch with this implemented. I've left a NOTICE in there to
make it easier for people to follow along at home and see when the
lock table is reset.

There will be a bit of additional overhead to the reset detection
logic over the v7 patch. Namely: additional hash_get_num_entries()
call before releasing the locks, and more complex and floating-point
maths instead of more simple integer maths in v7.

Here's a demo with the debug NOTICE in there to show us what's going on.

-- setup
create table a (a int) partition by range (a);
select 'create table a'||x||' partition of a for values from('||x||')
to ('||x+1||');' from generate_Series(1,1000) x;
\gexec

$ psql postgres
NOTICE:  max_bucket = 15, threshold = 64.00, running_avg_locks
0.10 Reset? No
psql (13devel)
# \o /dev/null
# select * from a where a < 100;
NOTICE:  max_bucket = 101, threshold = 64.00, running_avg_locks
10.09 Reset? Yes
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 76.324005, running_avg_locks
19.081001 Reset? Yes
# select * from a where a < 100;

A couple of needless resets there... Maybe we can get rid of those by
setting the initial running average up to something higher than 0.0.

NOTICE:  max_bucket = 99, threshold = 108.691605, running_avg_locks
27.172901 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 137.822449, running_avg_locks
34.455612 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 164.040207, running_avg_locks
41.010052 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 187.636185, running_avg_locks
46.909046 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 208.872559, running_avg_locks
52.218140 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 227.985306, running_avg_locks
56.996326 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 245.186768, running_avg_locks
61.296692 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 260.668091, running_avg_locks
65.167023 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 274.601288, running_avg_locks
68.650322 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 287.141174, running_avg_locks
71.785294 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 298.427063, running_avg_locks
74.606766 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 308.584351, running_avg_locks
77.146088 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 317.725922, running_avg_locks
79.431480 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 325.953339, running_avg_locks
81.488335 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 333.358002, running_avg_locks
83.339500 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 340.022217, running_avg_locks
85.005554 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 346.019989, running_avg_locks
86.504997 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 351.417999, running_avg_locks
87.854500 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 356.276184, running_avg_locks
89.069046 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 360.648560, running_avg_locks
90.162140 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 364.583710, running_avg_locks
91.145927 Reset? No
# select * from a where a < 100;
NOTICE:  max_bucket = 99, threshold = 368.125336, running_avg_locks
92.031334 Reset? No
# select * from a

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley

Thanks for having a look at this.

On Wed, 24 Jul 2019 at 04:13, Tom Lane  wrote:
>
> David Rowley  writes:
> > I'm pretty happy with v7 now. If anyone has any objections to it,
> > please speak up very soon.  Otherwise, I plan to push it about this
> > time tomorrow.
>
> I dunno, this seems close to useless in this form.  As it stands,
> once hash_get_max_bucket has exceeded the threshold, you will
> arbitrarily reset the table 1000 transactions later (since the
> max bucket is certainly not gonna decrease otherwise).  So that's
> got two big problems:
>
> 1. In the assumed-common case where most of the transactions take
> few locks, you wasted cycles for 999 transactions.
>
> 2. You'll reset the table independently of subsequent history,
> even if the session's usage is pretty much always over the
> threshold.  Admittedly, if you do this only once per 1K
> transactions, it's probably not a horrible overhead --- but you
> can't improve point 1 without making it a bigger overhead.

This is true, but I think you might be overestimating just how much
effort is wasted with #1. We're only seeing this overhead in small
very fast to execute xacts. In the test case in [1], I was getting
about 10k TPS unpatched and about 15k patched. This means that, on
average, 1 unpatched xact takes 100 microseconds and 1 average patched
xact takes 66 microseconds, so the additional time spent doing the
hash_seq_search() must be about 34 microseconds. So we'll waste a
total of 34 *milliseconds* if we wait for 1000 xacts before we reset
the table.  With 10k TPS we're going to react to the change in 0.1
seconds.

I think you'd struggle to measure that 1 xact is taking 34
microseconds longer without running a few thousand queries. My view is
that nobody is ever going to notice that it takes 1k xacts to shrink
the table, and I've already shown that the overhead of the shrink
every 1k xact is tiny. I mentioned 0.34% in [1] using v6. This is
likely a bit smaller in v7 due to swapping the order of the if
condition to put the less likely case first. Since the overhead of
rebuilding the table was 7% when done every xact, then it stands to
reason that this has become 0.007% doing it every 1k xats and that the
additional overhead to make up that 0.34% is from testing if the reset
condition has been met (or noise).  That's not something we can remove
completely with any solution that resets the hash table.

> I did complain about the cost of tracking the stats proposed by
> some earlier patches, but I don't think the answer to that is to
> not track any stats at all.  The proposed hash_get_max_bucket()
> function is quite cheap, so I think it wouldn't be out of line to
> track the average value of that at transaction end over the
> session's lifespan, and reset if the current value is more than
> some-multiple of the average.
>
> The tricky part here is that if some xact kicks that up to
> say 64 entries, and we don't choose to reset, then the reading
> for subsequent transactions will be 64 even if they use very
> few locks.  So we probably need to not use a plain average,
> but account for that effect somehow.  Maybe we could look at
> how quickly the number goes up after we reset?
>
> [ thinks for awhile... ]  As a straw-man proposal, I suggest
> the following (completely untested) idea:
>
> * Make the table size threshold variable, not constant.
>
> * If, at end-of-transaction when the table is empty,
> the table bucket count exceeds the threshold, reset
> immediately; but if it's been less than K transactions
> since the last reset, increase the threshold (by say 10%).
>
> I think K can be a constant; somewhere between 10 and 100 would
> probably work.  At process start, we should act as though the last
> reset were more than K transactions ago (i.e., don't increase the
> threshold at the first reset).

I think the problem with this idea is that there is no way once the
threshold has been enlarged to recover from that to work better
workloads that require very few locks again. If we end up with some
large value for the variable threshold, there's no way to decrease
that again.  All this method stops is the needless hash table resets
if the typical case requires many locks. The only way to know if we
can reduce the threshold again is to count the locks released during
LockReleaseAll().  Some version of the patch did that, and you
objected.

> The main advantage this has over v7 is that we don't have the
> 1000-transaction delay before reset, which ISTM is giving up
> much of the benefit of the whole idea.  Also, if the session
> is consistently using lots of locks, we'll adapt to that after
> awhile and not do useless table resets.

True, but you neglected to mention the looming and critical drawback,
which pretty much makes that idea useless. All we'd need to do is give
this a workload that throws that variable threshold up high so that it
can't recover. It would be pretty simple then to show that
LockReleaseAll() is still slow with

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread Tom Lane

David Rowley  writes:
> I've attached v7, which really is v6 with some comments adjusted and
> the order of the hash_get_num_entries and hash_get_max_bucket function
> calls swapped.  I think hash_get_num_entries() will return 0 most of
> the time where we're calling it, so it makes sense to put the
> condition that's less likely to be true first in the if condition.

> I'm pretty happy with v7 now. If anyone has any objections to it,
> please speak up very soon.  Otherwise, I plan to push it about this
> time tomorrow.

I dunno, this seems close to useless in this form.  As it stands,
once hash_get_max_bucket has exceeded the threshold, you will
arbitrarily reset the table 1000 transactions later (since the
max bucket is certainly not gonna decrease otherwise).  So that's
got two big problems:

1. In the assumed-common case where most of the transactions take
few locks, you wasted cycles for 999 transactions.

2. You'll reset the table independently of subsequent history,
even if the session's usage is pretty much always over the
threshold.  Admittedly, if you do this only once per 1K
transactions, it's probably not a horrible overhead --- but you
can't improve point 1 without making it a bigger overhead.

I did complain about the cost of tracking the stats proposed by
some earlier patches, but I don't think the answer to that is to
not track any stats at all.  The proposed hash_get_max_bucket()
function is quite cheap, so I think it wouldn't be out of line to
track the average value of that at transaction end over the
session's lifespan, and reset if the current value is more than
some-multiple of the average.

The tricky part here is that if some xact kicks that up to
say 64 entries, and we don't choose to reset, then the reading
for subsequent transactions will be 64 even if they use very
few locks.  So we probably need to not use a plain average,
but account for that effect somehow.  Maybe we could look at
how quickly the number goes up after we reset?

[ thinks for awhile... ]  As a straw-man proposal, I suggest
the following (completely untested) idea:

* Make the table size threshold variable, not constant.

* If, at end-of-transaction when the table is empty,
the table bucket count exceeds the threshold, reset
immediately; but if it's been less than K transactions
since the last reset, increase the threshold (by say 10%).

I think K can be a constant; somewhere between 10 and 100 would
probably work.  At process start, we should act as though the last
reset were more than K transactions ago (i.e., don't increase the
threshold at the first reset).

The main advantage this has over v7 is that we don't have the
1000-transaction delay before reset, which ISTM is giving up
much of the benefit of the whole idea.  Also, if the session
is consistently using lots of locks, we'll adapt to that after
awhile and not do useless table resets.

regards, tom lane

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-23 Thread David Rowley

On Tue, 23 Jul 2019 at 15:47, Tsunakawa, Takayuki
 wrote:
> OTOH, how about my original patch that is based on the local lock list?  I 
> expect that it won't that significant slowdown in the same test case.  If 
> it's not satisfactory, then v6 is the best to commit.

I think we need to move beyond the versions that have been reviewed
and rejected. I don't think the reasons for their rejection will have
changed.

I've attached v7, which really is v6 with some comments adjusted and
the order of the hash_get_num_entries and hash_get_max_bucket function
calls swapped.  I think hash_get_num_entries() will return 0 most of
the time where we're calling it, so it makes sense to put the
condition that's less likely to be true first in the if condition.

I'm pretty happy with v7 now. If anyone has any objections to it,
please speak up very soon.  Otherwise, I plan to push it about this
time tomorrow.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

shrink_bloated_locallocktable_v7.patch
Description: Binary data

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-22 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Another counter-argument to this is that there's already an
> unexplainable slowdown after you run a query which obtains a large
> number of locks in a session or use prepared statements and a
> partitioned table with the default plan_cache_mode setting. Are we not
> just righting a wrong here? Albeit, possibly 1000 queries later.
> 
> I am, of course, open to other ideas which solve the problem that v5
> has, but failing that, I don't see v6 as all that bad.  At least all
> the logic is contained in one function.  I know Tom wanted to steer
> clear of heuristics to reinitialise the table, but most of the stuff
> that was in the patch back when he complained was trying to track the
> average number of locks over the previous N transactions, and his
> concerns were voiced before I showed the 7% performance regression
> with unconditionally rebuilding the table.

I think I understood what you mean.  Sorry, I don't have a better idea.  This 
unexplanability is probably what we should accept to avoid the shocking 7% 
slowdown.

OTOH, how about my original patch that is based on the local lock list?  I 
expect that it won't that significant slowdown in the same test case.  If it's 
not satisfactory, then v6 is the best to commit.


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-22 Thread David Rowley

On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > I went back to the drawing board on this and I've added some code that 
> > counts
> > the number of times we've seen the table to be oversized and just shrinks
> > the table back down on the 1000th time.  6.93% / 1000 is not all that much.
>
> I'm afraid this kind of hidden behavior would appear mysterious to users.  
> They may wonder "Why is the same query fast at first in the session (5 or 6 
> times of execution), then gets slower for a while, and gets faster again?  Is 
> there something to tune?  Am I missing something wrong with my app (e.g. how 
> to use prepared statements)?"  So I prefer v5.

Another counter-argument to this is that there's already an
unexplainable slowdown after you run a query which obtains a large
number of locks in a session or use prepared statements and a
partitioned table with the default plan_cache_mode setting. Are we not
just righting a wrong here? Albeit, possibly 1000 queries later.

I am, of course, open to other ideas which solve the problem that v5
has, but failing that, I don't see v6 as all that bad.  At least all
the logic is contained in one function.  I know Tom wanted to steer
clear of heuristics to reinitialise the table, but most of the stuff
that was in the patch back when he complained was trying to track the
average number of locks over the previous N transactions, and his
concerns were voiced before I showed the 7% performance regression
with unconditionally rebuilding the table.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread David Rowley

On Mon, 22 Jul 2019 at 16:36, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > For the use case we've been measuring with partitioned tables and the
> > generic plan generation causing a sudden spike in the number of
> > obtained locks, then having plan_cache_mode = force_custom_plan will
> > cause the lock table not to become bloated. I'm not sure there's
> > anything interesting to measure there.
>
> I meant the difference between the following two cases, where the query only 
> touches one partition (e.g. SELECT ... WHERE pkey = value):
>
> * plan_cache_mode=force_custom_plan: LocalLockHash won't bloat.  The query 
> execution time is steady.
>
> * plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to 
> the creation of the generic plan.  The generic plan is not adopted because 
> its cost is high.  Later executions of the query will suffer from the bloat 
> until the 1006th execution when LocalLockHash is shrunk.

I measured this again in
https://www.postgresql.org/message-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=j-b5a2m...@mail.gmail.com
where I posted the v6 patch.  It's the final results in the email.   I
didn't measure plan_cache_mode = force_custom_plan. There'd be no lock
table bloat in that case and the additional overhead would just be
from the hash_get_num_entries() && hash_get_max_bucket() calls, which
the first results show next to no overhead for.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> For the use case we've been measuring with partitioned tables and the
> generic plan generation causing a sudden spike in the number of
> obtained locks, then having plan_cache_mode = force_custom_plan will
> cause the lock table not to become bloated. I'm not sure there's
> anything interesting to measure there.

I meant the difference between the following two cases, where the query only 
touches one partition (e.g. SELECT ... WHERE pkey = value):

* plan_cache_mode=force_custom_plan: LocalLockHash won't bloat.  The query 
execution time is steady.

* plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the 
creation of the generic plan.  The generic plan is not adopted because its cost 
is high.  Later executions of the query will suffer from the bloat until the 
1006th execution when LocalLockHash is shrunk.


Depending on the number of transactions and what each transaction does, I 
thought the difference will be noticeable or not.


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread David Rowley

On Mon, 22 Jul 2019 at 14:21, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > I personally don't think that's true.  The only way you'll notice the
> > LockReleaseAll() overhead is to execute very fast queries with a
> > bloated lock table.  It's pretty hard to notice that a single 0.1ms
> > query is slow. You'll need to execute thousands of them before you'll
> > be able to measure it, and once you've done that, the lock shrink code
> > will have run and the query will be performing optimally again.
>
> Maybe so.  Will the difference be noticeable between plan_cache_mode=auto 
> (default) and plan_cache_mode=custom?

For the use case we've been measuring with partitioned tables and the
generic plan generation causing a sudden spike in the number of
obtained locks, then having plan_cache_mode = force_custom_plan will
cause the lock table not to become bloated. I'm not sure there's
anything interesting to measure there. The only additional code that
gets executed is the hash_get_num_entries() and possibly
hash_get_max_bucket. Maybe it's worth swapping the order of those
calls since most of the time the entry will be 0 and the max bucket
count threshold won't be exceeded.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I personally don't think that's true.  The only way you'll notice the
> LockReleaseAll() overhead is to execute very fast queries with a
> bloated lock table.  It's pretty hard to notice that a single 0.1ms
> query is slow. You'll need to execute thousands of them before you'll
> be able to measure it, and once you've done that, the lock shrink code
> will have run and the query will be performing optimally again.

Maybe so.  Will the difference be noticeable between plan_cache_mode=auto 
(default) and plan_cache_mode=custom?


> I voice my concerns with v5 and I wasn't really willing to push it
> with a known performance regression of 7% in a fairly valid case. v6
> does not suffer from that.

You're right.  We may have to consider the unpredictability to users by this 
hidden behavior as a compromise for higher throughput.


> > Where else does the extra overhead come from?
> 
> hash_get_num_entries(LockMethodLocalHash) == 0 &&
> + hash_get_max_bucket(LockMethodLocalHash) >
> + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)
> 
> that's executed every time, not every 1000 times.

I see.  Thanks.


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread David Rowley

On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> > I went back to the drawing board on this and I've added some code that 
> > counts
> > the number of times we've seen the table to be oversized and just shrinks
> > the table back down on the 1000th time.  6.93% / 1000 is not all that much.
>
> I'm afraid this kind of hidden behavior would appear mysterious to users.  
> They may wonder "Why is the same query fast at first in the session (5 or 6 
> times of execution), then gets slower for a while, and gets faster again?  Is 
> there something to tune?  Am I missing something wrong with my app (e.g. how 
> to use prepared statements)?"  So I prefer v5.

I personally don't think that's true.  The only way you'll notice the
LockReleaseAll() overhead is to execute very fast queries with a
bloated lock table.  It's pretty hard to notice that a single 0.1ms
query is slow. You'll need to execute thousands of them before you'll
be able to measure it, and once you've done that, the lock shrink code
will have run and the query will be performing optimally again.

I voice my concerns with v5 and I wasn't really willing to push it
with a known performance regression of 7% in a fairly valid case. v6
does not suffer from that.

> > Of course, not all the extra overhead might be from rebuilding the table,
> > so here's a test with the updated patch.
>
> Where else does the extra overhead come from?

hash_get_num_entries(LockMethodLocalHash) == 0 &&
+ hash_get_max_bucket(LockMethodLocalHash) >
+ LOCKMETHODLOCALHASH_SHRINK_THRESHOLD)

that's executed every time, not every 1000 times.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I went back to the drawing board on this and I've added some code that counts
> the number of times we've seen the table to be oversized and just shrinks
> the table back down on the 1000th time.  6.93% / 1000 is not all that much.

I'm afraid this kind of hidden behavior would appear mysterious to users.  They 
may wonder "Why is the same query fast at first in the session (5 or 6 times of 
execution), then gets slower for a while, and gets faster again?  Is there 
something to tune?  Am I missing something wrong with my app (e.g. how to use 
prepared statements)?"  So I prefer v5.


> Of course, not all the extra overhead might be from rebuilding the table,
> so here's a test with the updated patch.

Where else does the extra overhead come from?


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-21 Thread David Rowley

On Thu, 18 Jul 2019 at 14:53, David Rowley  wrote:
> Is anyone particularly concerned about the worst-case slowdown here
> being about 1.54%?  The best case, and arguably a more realistic case
> above showed a 34% speedup for the best case.

I took a bit more time to test the performance on this. I thought I
might have been a bit unfair on the patch by giving it completely
empty tables to look at. It just seems too unrealistic to have a large
number of completely empty partitions. I decided to come up with a
more realistic case where there are 140 partitions but we're
performing an index scan to find a record that can exist in only 1 of
the 140 partitions.

create table hp (a int, b int) partition by hash(a);
select 'create table hp'||x||' partition of hp for values with
(modulus 140, remainder ' || x || ');' from generate_series(0,139)x;
create index on hp (b);
insert into hp select x,x from generate_series(1, 14) x;
analyze hp;

select3.sql: select * from hp where b = 1

Master:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2124.591367 (excluding connections establishing)
tps = 2158.754837 (excluding connections establishing)
tps = 2146.348465 (excluding connections establishing)
tps = 2148.995800 (excluding connections establishing)
tps = 2154.223600 (excluding connections establishing)

Patched:

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2002.480729 (excluding connections establishing)
tps = 1997.272944 (excluding connections establishing)
tps = 1992.527079 (excluding connections establishing)
tps = 1995.789367 (excluding connections establishing)
tps = 2001.195760 (excluding connections establishing)

so it turned out it's even slower, and not by a small amount either!
Patched is 6.93% slower on average with this case :-(

I went back to the drawing board on this and I've added some code that
counts the number of times we've seen the table to be oversized and
just shrinks the table back down on the 1000th time.  6.93% / 1000 is
not all that much. Of course, not all the extra overhead might be from
rebuilding the table, so here's a test with the updated patch.

$ pgbench -n -f select3.sql -T 60 -M prepared postgres
tps = 2142.414323 (excluding connections establishing)
tps = 2139.666899 (excluding connections establishing)
tps = 2138.744789 (excluding connections establishing)
tps = 2138.300299 (excluding connections establishing)
tps = 2137.346278 (excluding connections establishing)

Just a 0.34% drop. Pretty hard to pick that out the noise.

Testing the original case that shows the speedup:

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

Master:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 10172.035036 (excluding connections establishing)
tps = 10192.780529 (excluding connections establishing)
tps = 10331.306003 (excluding connections establishing)

Patched:

$ pgbench -n -f select.sql -T 60 -M prepared postgres
tps = 15080.765549 (excluding connections establishing)
tps = 14994.404069 (excluding connections establishing)
tps = 14982.923480 (excluding connections establishing)

That seems fine, 46% faster.

v6 is attached.

I plan to push this in a few days unless someone objects.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v6.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-07-17 Thread David Rowley

On Thu, 27 Jun 2019 at 12:59, Tsunakawa, Takayuki
 wrote:
>
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Thank you, looks good.  I find it ready for committer (I noticed the status 
> is already set so.)

Thanks for looking.

I've just been looking at this again and I thought I'd better check
the performance of the worst case for the patch, where the hash table
is rebuilt each query.

To do this I first created a single column 70 partition partitioned
table ("p") and left it empty.

I then checked the performance of:

SELECT * FROM p;

Having 70 partitions means that the lock table's max bucket goes over
the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD which is set to 64 and
results in the table being rebuilt each time the query is run.

The performance was as follows:

70 partitions: LOCKMETHODLOCALHASH_SHRINK_THRESHOLD = 64

master + shrink_bloated_locallocktable_v5.patch:

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 8427.053378 (excluding connections establishing)
tps = 8583.251821 (excluding connections establishing)
tps = 8569.587268 (excluding connections establishing)
tps = 8552.988483 (excluding connections establishing)
tps = 8527.735108 (excluding connections establishing)

master (93907478):

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 8712.919411 (excluding connections establishing)
tps = 8760.190372 (excluding connections establishing)
tps = 8755.069470 (excluding connections establishing)
tps = 8747.389735 (excluding connections establishing)
tps = 8758.275202 (excluding connections establishing)

patched is 2.45% slower

If I increase the partition count to 140 and put the
LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up to 128, then the performance
is as follows:

master + shrink_bloated_locallocktable_v5.patch:

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 2548.917856 (excluding connections establishing)
tps = 2561.283564 (excluding connections establishing)
tps = 2549.669870 (excluding connections establishing)
tps = 2421.971864 (excluding connections establishing)
tps = 2428.983660 (excluding connections establishing)

Master (93907478):

ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres
tps = 2605.407529 (excluding connections establishing)
tps = 2600.691426 (excluding connections establishing)
tps = 2594.123983 (excluding connections establishing)
tps = 2455.745644 (excluding connections establishing)
tps = 2450.061483 (excluding connections establishing)

patched is 1.54% slower

I'd rather not put the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up any
higher than 128 since it can detract from the improvement we're trying
to make with this patch.

Now, this case of querying a partitioned table that happens to be
completely empty seems a bit unrealistic. Something more realistic
might be index scanning all partitions to find a value that only
exists in a single partition. Assuming the partitions actually have
some records, then that's going to be a more expensive query, so the
overhead of rebuilding the table will be less noticeable.

A previous version of the patch has already had some heuristics to try
to only rebuild the hash table when it's likely beneficial. I'd rather
not go exploring in that area again.

Is anyone particularly concerned about the worst-case slowdown here
being about 1.54%?  The best case, and arguably a more realistic case
above showed a 34% speedup for the best case.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-26 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
v5 is attached.


Thank you, looks good.  I find it ready for committer (I noticed the status is 
already set so.)


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-26 Thread David Rowley

On Mon, 17 Jun 2019 at 15:05, Tsunakawa, Takayuki
 wrote:
> (1)
> +#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64
>
> How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the 
> threshold value to trigger shrinkage?  Code in PostgreSQL seems to use the 
> term threshold.

That's probably better. I've renamed it to that.

> (2)
> +/* Complain if the above are not set to something sane */
> +#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
> +#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
> +#endif
>
> I don't think these are necessary, because these are fixed and not 
> configurable.  FYI, src/include/utils/memutils.h doesn't have #error to test 
> these macros.

Yeah. I was thinking it was overkill when I wrote it, but somehow
couldn't bring myself to remove it. Done now.

> (3)
> +   if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
> +   hash_get_max_bucket(LockMethodLocalHash) >
> +   LOCKMETHODLOCALHASH_SHRINK_SIZE)
> +   CreateLocalLockHash();
>
> I get an impression that Create just creates something where there's nothing. 
>  How about Init or Recreate?

Renamed to InitLocalLoclHash()

v5 is attached.

Thank you for the review.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v5.patch
Description: Binary data

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-16 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> I've revised the patch to add a new constant named
> LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash

Thank you, and good performance.  The patch passed make check.

I'm OK with the current patch, but I have a few comments.  Please take them as 
you see fit (I wouldn't mind if you don't.)


(1)
+#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64

How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the 
threshold value to trigger shrinkage?  Code in PostgreSQL seems to use the term 
threshold.


(2)
+/* Complain if the above are not set to something sane */
+#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE
+#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE"
+#endif

I don't think these are necessary, because these are fixed and not 
configurable.  FYI, src/include/utils/memutils.h doesn't have #error to test 
these macros.

#define ALLOCSET_DEFAULT_MINSIZE   0
#define ALLOCSET_DEFAULT_INITSIZE  (8 * 1024)
#define ALLOCSET_DEFAULT_MAXSIZE   (8 * 1024 * 1024)


(3)
+   if (hash_get_num_entries(LockMethodLocalHash) == 0 &&
+   hash_get_max_bucket(LockMethodLocalHash) >
+   LOCKMETHODLOCALHASH_SHRINK_SIZE)
+   CreateLocalLockHash();

I get an impression that Create just creates something where there's nothing.  
How about Init or Recreate?


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-06-14 Thread David Rowley

On Mon, 8 Apr 2019 at 04:09, Tom Lane  wrote:
> Also, I would not define "significantly bloated" as "the table has
> grown at all".  I think the threshold ought to be at least ~100
> buckets, if we're starting at 16.

I've revised the patch to add a new constant named
LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the
hash table grows over that size we shrink it back down to
LOCKMETHODLOCALHASH_INIT_SIZE, which I've kept at 16.

I'm not opposed to setting it to 128. For this particular benchmark,
it won't make any difference as it's only going to affect something
that does not quite use 128 locks and has to work with a slightly
bloated local lock table. I think hitting 64 locks in a transaction is
a good indication that it's not a simple transaction so users are
probably unlikely to notice the small slowdown from the hash table
reinitialisation.

Since quite a bit has changed around partition planning lately, I've
taken a fresh set of benchmarks on today's master. I'm using something
very close to Amit's benchmark from upthread. I just changed the query
so we hit the same partition each time instead of a random one.

create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values
with (modulus 8192, remainder ' || (x)::text || ');' from
generate_series(0,8191) x;
\gexec

select.sql:
\set p 1
select * from ht where a = :p

master @ a193cbec119 + shrink_bloated_locallocktable_v4.patch:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14101.226982 (excluding connections establishing)
tps = 14034.250962 (excluding connections establishing)
tps = 14107.937755 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 14240.366770 (excluding connections establishing)
tps = 14272.244886 (excluding connections establishing)
tps = 14130.684315 (excluding connections establishing)

master @ a193cbec119:

plan_cache_mode = 'auto';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 10467.027666 (excluding connections establishing)
tps = 10333.700917 (excluding connections establishing)
tps = 10633.084426 (excluding connections establishing)

plan_cache_mode = 'force_custom_plan';

ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 13938.083272 (excluding connections establishing)
tps = 14143.241802 (excluding connections establishing)
tps = 14097.406758 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

shrink_bloated_locallocktable_v4.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-08 Thread Peter Eisentraut

On 2019-04-08 05:46, Tsunakawa, Takayuki wrote:
> I'm registering you as another author and me as a reviewer, and marking this 
> ready for committer.

Moved to next commit fest.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> It would be good to get your view on the
> shrink_bloated_locallocktable_v3.patch I worked on last night. I was
> unable to measure any overhead to solving the problem that way.

Thanks, it looks super simple and good.  I understood the idea behind your 
patch is:

* Transactions that touch many partitions and/or tables are a special event and 
not normal, and the hash table bloat is an unlucky accident.  So it's 
reasonable to revert the bloated hash table back to the original size.

* Repeated transactions that acquire many locks have to enlarge the hash table 
every time.  However, the overhead of hash table expansion should be hidden 
behind other various processing (acquiring/releasing locks, reading/writing the 
relations, accessing the catalogs of those relations)


TBH, I think the linked list approach feels more intuitive because the 
resulting code looks what it wants to do (efficiently iterate over acquired 
locks) and is based on the classic book.  But your approach seems to relieve 
people.  So I'm OK with your patch.

I'm registering you as another author and me as a reviewer, and marking this 
ready for committer.


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley

On Mon, 8 Apr 2019 at 14:54, Tsunakawa, Takayuki
 wrote:
>
> From: 'Andres Freund' [mailto:and...@anarazel.de]
> > Did you see that people measured slowdowns?
>
> Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels 
> like a somewhat extreme test case.  And that might be within noise as was 
> mentioned.
>
> If we want to remove even the noise, we may have to think of removing the 
> LocalLockHash completely.  But it doesn't seem feasible...

It would be good to get your view on the
shrink_bloated_locallocktable_v3.patch I worked on last night. I was
unable to measure any overhead to solving the problem that way.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: 'Andres Freund' [mailto:and...@anarazel.de]
> On 2019-04-08 02:28:12 +, Tsunakawa, Takayuki wrote:
> > I think the linked list of LOCALLOCK approach is natural, simple, and
> > good.
> 
> Did you see that people measured slowdowns?

Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like 
a somewhat extreme test case.  And that might be within noise as was mentioned.

If we want to remove even the noise, we may have to think of removing the 
LocalLockHash completely.  But it doesn't seem feasible...

Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread 'Andres Freund'

Hi,

On 2019-04-08 02:28:12 +, Tsunakawa, Takayuki wrote:
> I think the linked list of LOCALLOCK approach is natural, simple, and
> good.

Did you see that people measured slowdowns?

Greetings,

Andres Freund

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tsunakawa, Takayuki

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> On the whole I don't think there's an adequate case for committing
> this patch.

From: Andres Freund [mailto:and...@anarazel.de]
> On 2019-04-05 23:03:11 -0400, Tom Lane wrote:
> > If I reduce the number of partitions in Amit's example from 8192
> > to something more real-world, like 128, I do still measure a
> > performance gain, but it's ~ 1.5% which is below what I'd consider
> > a reproducible win.  I'm accustomed to seeing changes up to 2%
> > in narrow benchmarks like this one, even when "nothing changes"
> > except unrelated code.
> 
> I'm not sure it's actually that narrow these days. With all the
> partitioning improvements happening, the numbers of locks commonly held
> are going to rise. And while 8192 partitions is maybe on the more
> extreme side, it's a workload with only a single table, and plenty
> workloads touch more than a single partitioned table.

I would feel happy if I could say such a many-partitions use case is narrow or 
impractical and ignore it, but it's not narrow.  Two of our customers are 
actually requesting such usage: one uses 5,500 partitions and is trying to 
migrate from a commercial database on Linux, and the other requires 200,000 
partitions to migrate from a legacy database on a mainframe.  At first, I 
thought such many partitions indicate a bad application design, but it sounded 
valid (or at least I can't insist that's bad).  PostgreSQL is now expected to 
handle such huge workloads.

From: Andres Freund [mailto:and...@anarazel.de]
> I'm not sure I'm quite that concerned. For one, a good bit of that space
> was up for grabs until the recent reordering of LOCALLOCK and nobody
> complained. But more importantly, I think commonly the amount of locks
> around is fairly constrained, isn't it? We can't really have that many
> concurrently held locks, due to the shared memory space, and the size of
> a LOCALLOCK isn't that big compared to say relcache entries.  We also
> probably fairly easily could win some space back - e.g. make nLocks 32
> bits.

+1

From: Tom Lane [mailto:t...@sss.pgh.pa.us]
> I'd also point out that this is hardly the only place where we've
> seen hash_seq_search on nearly-empty hash tables become a bottleneck.
> So I'm not thrilled about attacking that with one-table-at-time patches.
> I'd rather see us do something to let hash_seq_search win across
> the board.
> 
> I spent some time wondering whether we could adjust the data structure
> so that all the live entries in a hashtable are linked into one chain,
> but I don't quite see how to do it without adding another list link to
> struct HASHELEMENT, which seems pretty expensive.

I think the linked list of LOCALLOCK approach is natural, simple, and good.  In 
the Jim Gray's classic book "Transaction processing: concepts and techniques", 
we can find the following sentence in "8.4.5 Lock Manager Internal Logic."  The 
sample implementation code in the book uses a similar linked list to remember 
and release a transaction's acquired locks.

"All the locks of a transaction are kept in a list so they can be quickly found 
and released at commit or rollback."

And handling this issue with the LOCALLOCK linked list is more natural than 
with the hash table resize.  We just want to quickly find all grabbed locks, so 
we use a linked list.  A hash table is a mechanism to find a particular item 
quickly.  So it was merely wrong to use the hash table to iterate all grabbed 
locks.  Also, the hash table got big because some operation in the session 
needed it, and some subsequent operations in the same session may need it 
again.  So we wouldn't be relieved with shrinking the hash table.

Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley

On Mon, 8 Apr 2019 at 04:09, Tom Lane  wrote:
> Um ... I don't see where you're destroying the old hash?

In CreateLocalLockHash.

> Also, I entirely dislike wiring in assumptions about hash_seq_search's
> private state structure here.  I think it's worth having an explicit
> entry point in dynahash.c to get the current number of buckets.

Okay. Added hash_get_max_bucket()

> Also, I would not define "significantly bloated" as "the table has
> grown at all".  I think the threshold ought to be at least ~100
> buckets, if we're starting at 16.

I wouldn't either. I don't think the comment says that. It says there
can be slowdowns when its significantly bloated, and then goes on to
say that we just resize when it's bigger than standard.

> Probably we ought to try to gather some evidence to inform the
> choice of cutoff here.  Maybe instrument the regression tests to
> see how big the table typically gets?

In partition_prune.sql I see use of a bucket as high as 285 on my machine with:

drop table lp, coll_pruning, rlp, mc3p, mc2p, boolpart, rp,
coll_pruning_multi, like_op_noprune, lparted_by_int2, rparted_by_int2;

I've not added any sort of cut-off though as I benchmarked it and
surprisingly I don't see any slowdown with the worst case. So I'm
thinking there might not be any point.

alter system set plan_cache_mode = 'force_generic_plan';
create table hp (a int primary key) partition by hash (a);
select 'create table hp' || x::text || ' partition of hp for values
with (modulus 32, remainder ' || (x)::text || ');' from
generate_series(0,31) x;
\gexec

select.sql
\set p 1
select * from hp where a = :p

Master
$ pgbench -n -M prepared -f select.sql -T 60 postgres
tps = 11834.764309 (excluding connections establishing)
tps = 12279.212223 (excluding connections establishing)
tps = 12007.263547 (excluding connections establishing)

Patched:
$ pgbench -n -M prepared -f select.sql -T 60 postgres
tps = 13380.684817 (excluding connections establishing)
tps = 12790.999279 (excluding connections establishing)
tps = 12568.892558 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v3.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tom Lane

David Rowley  writes:
> Okay.  Here's another version with all the average locks code removed
> that only recreates the table when it's completely empty.

Um ... I don't see where you're destroying the old hash?

Also, I entirely dislike wiring in assumptions about hash_seq_search's
private state structure here.  I think it's worth having an explicit
entry point in dynahash.c to get the current number of buckets.

Also, I would not define "significantly bloated" as "the table has
grown at all".  I think the threshold ought to be at least ~100
buckets, if we're starting at 16.

Probably we ought to try to gather some evidence to inform the
choice of cutoff here.  Maybe instrument the regression tests to
see how big the table typically gets?

regards, tom lane

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley

On Mon, 8 Apr 2019 at 03:47, Andres Freund  wrote:
> Could you benchmark your adversarial case?

Which case?

I imagine the worst case for v2 is a query that just constantly asks
for over 16 locks. Perhaps a prepared plan, so not to add planner
overhead.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Andres Freund

Hi,

On 2019-04-08 03:40:52 +1200, David Rowley wrote:
> On Mon, 8 Apr 2019 at 03:20, Tom Lane  wrote:
> >
> > David Rowley  writes:
> > > The reason I thought it was a good idea to track some history there
> > > was to stop the lock table constantly being shrunk back to the default
> > > size every time a simple single table query was executed.
> >
> > I think that's probably gilding the lily, considering that this whole
> > issue is pretty new.  There's no evidence that expanding the local
> > lock table is a significant drag on queries that need a lot of locks.
> 
> Okay.  Here's another version with all the average locks code removed
> that only recreates the table when it's completely empty.

Could you benchmark your adversarial case?

- Andres

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley

On Mon, 8 Apr 2019 at 03:20, Tom Lane  wrote:
>
> David Rowley  writes:
> > The reason I thought it was a good idea to track some history there
> > was to stop the lock table constantly being shrunk back to the default
> > size every time a simple single table query was executed.
>
> I think that's probably gilding the lily, considering that this whole
> issue is pretty new.  There's no evidence that expanding the local
> lock table is a significant drag on queries that need a lot of locks.

Okay.  Here's another version with all the average locks code removed
that only recreates the table when it's completely empty.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable_v2.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tom Lane

David Rowley  writes:
> On Mon, 8 Apr 2019 at 02:59, Tom Lane  wrote:
>> We *should* be using hash_get_num_entries(), but only to verify
>> that the table is empty before resetting it.  The additional bit
>> that is needed is to see whether the number of buckets is large
>> enough to justify calling the table bloated.

> The reason I thought it was a good idea to track some history there
> was to stop the lock table constantly being shrunk back to the default
> size every time a simple single table query was executed.

I think that's probably gilding the lily, considering that this whole
issue is pretty new.  There's no evidence that expanding the local
lock table is a significant drag on queries that need a lot of locks.

regards, tom lane

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley

On Mon, 8 Apr 2019 at 02:59, Tom Lane  wrote:
>
> David Rowley  writes:
> > hash_get_num_entries() looks cheap enough to me. Can you explain why
> > you think that's too expensive?
>
> What I objected to cost-wise was counting the number of lock
> acquisitions/releases, which seems entirely beside the point.
>
> We *should* be using hash_get_num_entries(), but only to verify
> that the table is empty before resetting it.  The additional bit
> that is needed is to see whether the number of buckets is large
> enough to justify calling the table bloated.

The reason I thought it was a good idea to track some history there
was to stop the lock table constantly being shrunk back to the default
size every time a simple single table query was executed. For example,
a workload repeatably doing:

SELECT * FROM table_with_lots_of_partitions;
SELECT * FROM non_partitioned_table;

I was worried that obtaining locks on the partitioned table would
become a little slower because it would have to expand the hash table
each time the query is executed.

> > As cheap as possible sounds good, but I'm confused at why you think
> > the table will always be empty at the end of transaction.
>
> It's conceivable that it won't be, which is why we need a test.
> I'm simply arguing that if it is not, we can just postpone de-bloating
> till it is.  Session-level locks are so rarely used that there's no
> need to sweat about that case.

That seems fair.  It would certainly simplify the patch.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tom Lane

David Rowley  writes:
> On Mon, 8 Apr 2019 at 02:20, Tom Lane  wrote:
>> I like the concept ... but the particular implementation, not so much.
>> It seems way overcomplicated.  In the first place, why should we
>> add code to copy entries?  Just don't do it except when the table
>> is empty.  In the second, I think we could probably have a far
>> cheaper test for how big the table is --- maybe we'd need to
>> expose some function in dynahash.c, but the right way here is just
>> to see how many buckets there are.  I don't like adding statistics
>> counting for this, because it's got basically nothing to do with
>> what the actual problem is.  (If you acquire and release one lock,
>> and do that over and over, you don't have a bloat problem no
>> matter how many times you do it.)

> hash_get_num_entries() looks cheap enough to me. Can you explain why
> you think that's too expensive?

What I objected to cost-wise was counting the number of lock
acquisitions/releases, which seems entirely beside the point.

We *should* be using hash_get_num_entries(), but only to verify
that the table is empty before resetting it.  The additional bit
that is needed is to see whether the number of buckets is large
enough to justify calling the table bloated.

> As cheap as possible sounds good, but I'm confused at why you think
> the table will always be empty at the end of transaction.

It's conceivable that it won't be, which is why we need a test.
I'm simply arguing that if it is not, we can just postpone de-bloating
till it is.  Session-level locks are so rarely used that there's no
need to sweat about that case.

regards, tom lane

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley

On Mon, 8 Apr 2019 at 02:36, David Rowley  wrote:
> > LockMethodLocalHash is special in that it predictably goes to empty
> > at the end of every transaction, so that de-bloating at that point
> > is a workable strategy.  I think we'd probably need something more
> > robust if we were trying to fix this generally for all hash tables.
> > But if we're going to go with the one-off hack approach, we should
> > certainly try to keep that hack as simple as possible.
>
> As cheap as possible sounds good, but I'm confused at why you think
> the table will always be empty at the end of transaction. It's my
> understanding and I see from debugging that session level locks remain
> in there. If I don't copy those into the new table they'll be lost.

Or we could just skip the table recreation if there are no
session-levels.  That would require calling hash_get_num_entries() on
the table again and just recreating the table if there are 0 locks.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley

On Mon, 8 Apr 2019 at 02:20, Tom Lane  wrote:
> I like the concept ... but the particular implementation, not so much.
> It seems way overcomplicated.  In the first place, why should we
> add code to copy entries?  Just don't do it except when the table
> is empty.  In the second, I think we could probably have a far
> cheaper test for how big the table is --- maybe we'd need to
> expose some function in dynahash.c, but the right way here is just
> to see how many buckets there are.  I don't like adding statistics
> counting for this, because it's got basically nothing to do with
> what the actual problem is.  (If you acquire and release one lock,
> and do that over and over, you don't have a bloat problem no
> matter how many times you do it.)

hash_get_num_entries() looks cheap enough to me. Can you explain why
you think that's too expensive?

> LockMethodLocalHash is special in that it predictably goes to empty
> at the end of every transaction, so that de-bloating at that point
> is a workable strategy.  I think we'd probably need something more
> robust if we were trying to fix this generally for all hash tables.
> But if we're going to go with the one-off hack approach, we should
> certainly try to keep that hack as simple as possible.

As cheap as possible sounds good, but I'm confused at why you think
the table will always be empty at the end of transaction. It's my
understanding and I see from debugging that session level locks remain
in there. If I don't copy those into the new table they'll be lost.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Tom Lane

David Rowley  writes:
> On Sat, 6 Apr 2019 at 16:03, Tom Lane  wrote:
>> My own thought about how to improve this situation was just to destroy
>> and recreate LockMethodLocalHash at transaction end (or start)
>> if its size exceeded $some-value.  Leaving it permanently bloated seems
>> like possibly a bad idea, even if we get rid of all the hash_seq_searches
>> on it.

> Which I thought was an okay idea.  I think the one advantage that
> would have over making hash_seq_search() faster for large and mostly
> empty tables is that over-sized hash tables are just not very cache
> efficient, and if we don't need it to be that large then we should
> probably consider making it smaller again.

> I've had a go at implementing this and using Amit's benchmark the
> performance looks pretty good. I can't detect any slowdown for the
> general case.

I like the concept ... but the particular implementation, not so much.
It seems way overcomplicated.  In the first place, why should we
add code to copy entries?  Just don't do it except when the table
is empty.  In the second, I think we could probably have a far
cheaper test for how big the table is --- maybe we'd need to
expose some function in dynahash.c, but the right way here is just
to see how many buckets there are.  I don't like adding statistics
counting for this, because it's got basically nothing to do with
what the actual problem is.  (If you acquire and release one lock,
and do that over and over, you don't have a bloat problem no
matter how many times you do it.)

LockMethodLocalHash is special in that it predictably goes to empty
at the end of every transaction, so that de-bloating at that point
is a workable strategy.  I think we'd probably need something more
robust if we were trying to fix this generally for all hash tables.
But if we're going to go with the one-off hack approach, we should
certainly try to keep that hack as simple as possible.

regards, tom lane

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread David Rowley

On Sat, 6 Apr 2019 at 16:03, Tom Lane  wrote:
> I'd also point out that this is hardly the only place where we've
> seen hash_seq_search on nearly-empty hash tables become a bottleneck.
> So I'm not thrilled about attacking that with one-table-at-time patches.
> I'd rather see us do something to let hash_seq_search win across
> the board.

Rewinding back to mid-Feb:

You wrote:
> My own thought about how to improve this situation was just to destroy
> and recreate LockMethodLocalHash at transaction end (or start)
> if its size exceeded $some-value.  Leaving it permanently bloated seems
> like possibly a bad idea, even if we get rid of all the hash_seq_searches
> on it.

Which I thought was an okay idea.  I think the one advantage that
would have over making hash_seq_search() faster for large and mostly
empty tables is that over-sized hash tables are just not very cache
efficient, and if we don't need it to be that large then we should
probably consider making it smaller again.

I've had a go at implementing this and using Amit's benchmark the
performance looks pretty good. I can't detect any slowdown for the
general case.

master:

plan_cache_mode = auto:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 9373.698212 (excluding connections establishing)
tps = 9356.993148 (excluding connections establishing)
tps = 9367.579806 (excluding connections establishing)

plan_cache_mode = force_custom_plan:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12863.758185 (excluding connections establishing)
tps = 12787.766054 (excluding connections establishing)
tps = 12817.878940 (excluding connections establishing)

shrink_bloated_locallocktable.patch:

plan_cache_mode = auto:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12756.021211 (excluding connections establishing)
tps = 12800.939518 (excluding connections establishing)
tps = 12804.501977 (excluding connections establishing)

plan_cache_mode = force_custom_plan:

$ pgbench -n -M prepared -T 60 -f select.sql postgres
tps = 12763.448836 (excluding connections establishing)
tps = 12901.673271 (excluding connections establishing)
tps = 12856.512745 (excluding connections establishing)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


shrink_bloated_locallocktable.patch
Description: Binary data

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-07 Thread Peter Eisentraut

On 2019-04-06 05:03, Tom Lane wrote:
> Trying a standard pgbench test case (pgbench -M prepared -S with
> one client and an -s 10 database), it seems that the patch is about
> 0.5% slower than HEAD.  Again, that's below the noise threshold,
> but it's not promising for the net effects of this patch on workloads
> that aren't specifically about large and prunable partition sets.

In my testing, I've also noticed that it seems to be slightly on the
slower side for these simple tests.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-05 Thread Tom Lane

Andres Freund  writes:
> I wonder if one approach to solve this wouldn't be to just make the
> hashtable drastically smaller. Right now we'll often have have lots
> empty entries that are 72 bytes + dynahash overhead. That's plenty of
> memory that needs to be skipped over.   I wonder if we instead should
> have an array of held locallocks, and a hashtable with {hashcode,
> offset_in_array} + custom comparator for lookups.

Well, that's not going to work all that well for retail lock releases;
you'll end up with holes in the array, maybe a lot of them.

However, it led me to think of another way we might approach the general
hashtable problem: right now, we are not making any use of the fact that
the hashtable's entries are laid out in big slabs (arrays).  What if we
tried to ensure that the live entries are allocated fairly compactly in
those arrays, and then implemented hash_seq_search as a scan over the
arrays, ignoring the hash bucket structure per se?

We'd need a way to reliably tell a live entry from a free entry, but
again, there's plenty of space for a flag bit or two.

This might perform poorly if you allocated a bunch of entries,
freed most-but-not-all, and then wanted to seqscan the remainder;
you'd end up with the same problem I complained of above that
you're iterating over an array that's mostly uninteresting.
In principle we could keep count of the live vs free entries and
dynamically decide to scan via the hash bucket structure instead of
searching the storage array when the array is too sparse; but that
might be overly complicated.

I haven't tried to work this out in detail, it's just a late
night brainstorm.  But, again, I'd much rather solve this in
dynahash.c than by layering some kind of hack on top of it.

regards, tom lane

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-05 Thread Andres Freund

Hi,

On 2019-04-05 23:03:11 -0400, Tom Lane wrote:
> Peter Eisentraut  writes:
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> FWIW, I tried this patch against current HEAD (959d00e9d).
> Using the test case described by Amit at
> 
> I do measure an undeniable speedup, close to 35%.
> 
> However ... I submit that that's a mighty extreme test case.
> (I had to increase max_locks_per_transaction to get it to run
> at all.)  We should not be using that sort of edge case to drive
> performance optimization choices.
> 
> If I reduce the number of partitions in Amit's example from 8192
> to something more real-world, like 128, I do still measure a
> performance gain, but it's ~ 1.5% which is below what I'd consider
> a reproducible win.  I'm accustomed to seeing changes up to 2%
> in narrow benchmarks like this one, even when "nothing changes"
> except unrelated code.

I'm not sure it's actually that narrow these days. With all the
partitioning improvements happening, the numbers of locks commonly held
are going to rise. And while 8192 partitions is maybe on the more
extreme side, it's a workload with only a single table, and plenty
workloads touch more than a single partitioned table.

> Trying a standard pgbench test case (pgbench -M prepared -S with
> one client and an -s 10 database), it seems that the patch is about
> 0.5% slower than HEAD.  Again, that's below the noise threshold,
> but it's not promising for the net effects of this patch on workloads
> that aren't specifically about large and prunable partition sets.

Yea, that's concerning.

> I'm also fairly concerned about the effects of the patch on
> sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
> bytes, a 22% increase.  That's a lot if you're considering cases
> with many locks.

I'm not sure I'm quite that concerned. For one, a good bit of that space
was up for grabs until the recent reordering of LOCALLOCK and nobody
complained. But more importantly, I think commonly the amount of locks
around is fairly constrained, isn't it? We can't really have that many
concurrently held locks, due to the shared memory space, and the size of
a LOCALLOCK isn't that big compared to say relcache entries.  We also
probably fairly easily could win some space back - e.g. make nLocks 32
bits.

I wonder if one approach to solve this wouldn't be to just make the
hashtable drastically smaller. Right now we'll often have have lots
empty entries that are 72 bytes + dynahash overhead. That's plenty of
memory that needs to be skipped over.   I wonder if we instead should
have an array of held locallocks, and a hashtable with {hashcode,
offset_in_array} + custom comparator for lookups.  That'd mean we could
either scan the array of locallocks at release (which'd need to skip
over entries that have already been released), or we could scan the much
smaller hashtable sequentially.

I don't think the above idea is quite there, and I'm tired, but I
thought it might still be worth bringing up.

> I spent some time wondering whether we could adjust the data structure
> so that all the live entries in a hashtable are linked into one chain,
> but I don't quite see how to do it without adding another list link to
> struct HASHELEMENT, which seems pretty expensive.

Yea :(

Greetings,

Andres Freund

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-05 Thread Tom Lane

Peter Eisentraut  writes:
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

FWIW, I tried this patch against current HEAD (959d00e9d).
Using the test case described by Amit at

I do measure an undeniable speedup, close to 35%.

However ... I submit that that's a mighty extreme test case.
(I had to increase max_locks_per_transaction to get it to run
at all.)  We should not be using that sort of edge case to drive
performance optimization choices.

If I reduce the number of partitions in Amit's example from 8192
to something more real-world, like 128, I do still measure a
performance gain, but it's ~ 1.5% which is below what I'd consider
a reproducible win.  I'm accustomed to seeing changes up to 2%
in narrow benchmarks like this one, even when "nothing changes"
except unrelated code.

Trying a standard pgbench test case (pgbench -M prepared -S with
one client and an -s 10 database), it seems that the patch is about
0.5% slower than HEAD.  Again, that's below the noise threshold,
but it's not promising for the net effects of this patch on workloads
that aren't specifically about large and prunable partition sets.

I'm also fairly concerned about the effects of the patch on
sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88
bytes, a 22% increase.  That's a lot if you're considering cases
with many locks.

On the whole I don't think there's an adequate case for committing
this patch.

I'd also point out that this is hardly the only place where we've
seen hash_seq_search on nearly-empty hash tables become a bottleneck.
So I'm not thrilled about attacking that with one-table-at-time patches.
I'd rather see us do something to let hash_seq_search win across
the board.

I spent some time wondering whether we could adjust the data structure
so that all the live entries in a hashtable are linked into one chain,
but I don't quite see how to do it without adding another list link to
struct HASHELEMENT, which seems pretty expensive.

I'll sketch the idea I had, just in case it triggers a better idea
in someone else.  Assuming we are willing to add another pointer to
HASHELEMENT, use the two pointers to create a doubly-linked circular
list that includes all live entries in the hashtable, with a list
header in the hashtable's control area.  (Maybe we'd use a dlist for
this, but it's not essential.)  Require this list to be organized so
that all entries that belong to the same hash bucket are consecutive in
the list, and make each non-null hash bucket header point to the first
entry in the list for its bucket.  To allow normal searches to detect
when they've run through their bucket, add a flag to HASHELEMENT that
is set only in entries that are the first, or perhaps last, of their
bucket (so you'd detect end-of-bucket by checking the flag instead of
testing for a null pointer).  Adding a bool field is free due to
alignment considerations, at least on 64-bit machines.  Given this,
I think normal hash operations are more-or-less the same cost as
before, while hash_seq_search just has to follow the circular list.

I tried to figure out how to do the same thing with a singly-linked
instead of doubly-linked list, but it doesn't quite work: if you need
to remove the first element of a bucket, you have no cheap way to find
its predecessor in the overall list (which belongs to some other
bucket, but you don't know which one).  Maybe we could just mark such
entries dead (there's plenty of room for another flag bit) and plan
to clean them up later?  But it's not clear how to ensure that they'd
get cleaned up in any sort of timely fashion.

Another issue is that probably none of this works for the partitioned
hash tables we use for some of the larger shared-memory hashes.  But
I'm not sure we care about hash_seq_search for those, so maybe we just
say those are a different data structure.

regards, tom lane

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-05 Thread Peter Eisentraut

On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
>> Fixed.
> 
> Rebased on HEAD.

Do you need the dlist_foreach_modify() calls?  You are not actually
modifying the list structure.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Imai, Yoshikazu

On Fri, Apr 5, 2019 at 0:05 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> That's strange...  Peter, Imai-san, can you compare your test procedures?

Just for make sure, I described my test procedures in detail.

I install and setup HEAD and patched as follows.

[HEAD(413ccaa)]
(git pull)
./configure --prefix=/usr/local/pgsql-dev --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-dev/bin:$PATH
initdb -D /var/lib/pgsql/data-dev
vi /var/lib/pgsql/data-dev/postgresql.conf

port = 44201
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

pg_ctl -D /var/lib/pgsql/data-dev start



[HEAD(413ccaa) + patch]
(git pull)
patch -u -p1 < 0002.patch
./configure --prefix=/usr/local/pgsql-locallock --enable-depend
make clean
make

make install

su postgres
export PATH=/usr/local/pgsql-locallock/bin:$PATH
initdb -D /var/lib/pgsql/data-locallock
vi /var/lib/pgsql/data-locallock/postgresql.conf

port = 44301
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

pg_ctl -D /var/lib/pgsql/data-locallock start


And I tested as follows.

(creating partitioned table for port 44201)
(creating partitioned table for port 44301)
(creating select4096.sql)
for i in `seq 1 5`; do
  pgbench -n -f select4096.sql -T 60 -M prepared -p 44201 | grep including;
  pgbench -n -f select4096.sql -T 60 -M prepared -p 44301 | grep including;
done
tps = 8146.039546 (including connections establishing)
tps = 9021.340872 (including connections establishing)
tps = 8011.186017 (including connections establishing)
tps = 8926.191054 (including connections establishing)
tps = 8006.769690 (including connections establishing)
tps = 9028.716806 (including connections establishing)
tps = 8057.709961 (including connections establishing)
tps = 9017.713714 (including connections establishing)
tps = 7956.332863 (including connections establishing)
tps = 9126.650533 (including connections establishing)


Thanks
--
Yoshikazu Imai

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Tsunakawa, Takayuki

Hi Amit-san, Imai-snan,

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> I was able to detect it as follows.
> plan_cache_mode = auto
> 
>HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>HEAD: 2402 tps
> Patched: 2393 tps

Thanks a lot for very quick confirmation.  I'm relieved to still see the good 
results.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Imai, Yoshikazu

On Fri, Apr 5, 2019 at 1:31 AM, Amit Langote wrote:
> On 2019/04/05 5:42, Peter Eisentraut wrote:
> > On 2019-04-04 06:58, Amit Langote wrote:
> >> Also, since the "speed up partition planning" patch went in
> >> (428b260f8), it might be possible to see the performance boost even
> >> with the partitioning example you cited upthread.
> >
> > I can't detect any performance improvement with the patch applied to
> > current master, using the test case from Yoshikazu Imai (2019-03-19).
> 
> I was able to detect it as follows.
> 
> * partitioned table setup:
> 
> $ cat ht.sql
> drop table ht cascade;
> create table ht (a int primary key, b int, c int) partition by hash (a);
> select 'create table ht' || x::text || ' partition of ht for values with
> (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
> 8191) x;
> \gexec
> 
> * pgbench script:
> 
> $ cat select.sql
> \set param random(1, 8192)
> select * from ht where a = :param
> 
> * pgbench (5 minute run with -M prepared)
> 
> pgbench -n -M prepared -T 300 -f select.sql
> 
> * tps:
> 
> plan_cache_mode = auto
> 
>HEAD: 1915 tps
> Patched: 2394 tps
> 
> plan_cache_mode = custom (non-problematic: generic plan is never created)
> 
>HEAD: 2402 tps
> Patched: 2393 tps

Amit-san, thanks for testing this.

I also re-ran my tests(3/19) with HEAD(413ccaa) and HEAD(413ccaa) + patched, 
and I can still detect the performance difference with plan_cache_mode = auto.

Thanks
--
Yoshikazu Imai

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Amit Langote

On 2019/04/05 5:42, Peter Eisentraut wrote:
> On 2019-04-04 06:58, Amit Langote wrote:
>> Also, since the "speed up partition planning" patch went in (428b260f8),
>> it might be possible to see the performance boost even with the
>> partitioning example you cited upthread.
> 
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

I was able to detect it as follows.

* partitioned table setup:

$ cat ht.sql
drop table ht cascade;
create table ht (a int primary key, b int, c int) partition by hash (a);
select 'create table ht' || x::text || ' partition of ht for values with
(modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,
8191) x;
\gexec

* pgbench script:

$ cat select.sql
\set param random(1, 8192)
select * from ht where a = :param

* pgbench (5 minute run with -M prepared)

pgbench -n -M prepared -T 300 -f select.sql

* tps:

plan_cache_mode = auto

   HEAD: 1915 tps
Patched: 2394 tps

plan_cache_mode = custom (non-problematic: generic plan is never created)

   HEAD: 2402 tps
Patched: 2393 tps

Thanks,
Amit

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Tsunakawa, Takayuki

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> I can't detect any performance improvement with the patch applied to
> current master, using the test case from Yoshikazu Imai (2019-03-19).

That's strange...  Peter, Imai-san, can you compare your test procedures?

Peter, can you check and see the performance improvement with David's method on 
March 26 instead?


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-04 Thread Peter Eisentraut

On 2019-04-04 06:58, Amit Langote wrote:
> Also, since the "speed up partition planning" patch went in (428b260f8),
> it might be possible to see the performance boost even with the
> partitioning example you cited upthread.

I can't detect any performance improvement with the patch applied to
current master, using the test case from Yoshikazu Imai (2019-03-19).

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-03 Thread Amit Langote

Hi,

On 2019/04/04 13:37, Tsunakawa, Takayuki wrote:
> Hi Peter,
> 
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
>> I did a bit of performance testing, both a plain pgbench and the
>> suggested test case with 4096 partitions.  I can't detect any
>> performance improvements.  In fact, within the noise, it tends to be
>> just a bit on the slower side.
>>
>> So I'd like to kick it back to the patch submitter now and ask for more
>> justification and performance analysis.
>>
>> Perhaps "speeding up planning with partitions" needs to be accepted first?
> 
> David kindly showed how to demonstrate the performance improvement on March 
> 26, so I changed the status to needs review.  I'd appreciate it if you could 
> continue the final check.

Also, since the "speed up partition planning" patch went in (428b260f8),
it might be possible to see the performance boost even with the
partitioning example you cited upthread.

Thanks,
Amit

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-04-03 Thread Tsunakawa, Takayuki

Hi Peter,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> I did a bit of performance testing, both a plain pgbench and the
> suggested test case with 4096 partitions.  I can't detect any
> performance improvements.  In fact, within the noise, it tends to be
> just a bit on the slower side.
> 
> So I'd like to kick it back to the patch submitter now and ask for more
> justification and performance analysis.
> 
> Perhaps "speeding up planning with partitions" needs to be accepted first?

David kindly showed how to demonstrate the performance improvement on March 26, 
so I changed the status to needs review.  I'd appreciate it if you could 
continue the final check.


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> Here a benchmark doing that using pgbench's script weight feature.

Wow, I didn't know that pgbench has evolved to have such a convenient feature.  
Thanks for telling me how to utilize it in testing.  PostgreSQL is cool!


Regards
Takayuki Tsunakawa

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp]
> My understanding of what David wrote is that the slowness of bloated hash
> table is hard to notice, because planning itself is pretty slow.  With the
> "speeding up planning with partitions" patch, planning becomes quite fast,
> so the bloated hash table overhead and so your patch's benefit is easier
> to notice.  This patch is clearly helpful, but it's just hard to notice
> it
> when the other big bottleneck is standing in the way.

Ah, I see.  I failed to recognize the simple fact that without your patch, 
EXECUTE on a table with many partitions is slow due to the custom planning time 
proportional to the number of partitions.  Thanks for waking up my sleeping 
head!


Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread David Rowley

On Tue, 26 Mar 2019 at 21:55, David Rowley  wrote:
>
> On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki
>  wrote:
> > Thank you David for explaining.  Although I may not understand the effect 
> > of "speeding up planning with partitions" patch, this patch  takes effect 
> > even without it.  That is, perform the following in the same session:
> >
> > 1. SELECT count(*) FROM table; on a table with many partitions.  That 
> > bloats the LocalLockHash.
> > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
> > transaction.  Without the patch, each transaction's LockReleaseAll() has to 
> > scan the bloated large hash table.
>
> Oh. I think I see what you're saying.  Really the table in #2 would
> have to be some completely different table that's not partitioned. I
> think in that case it should make a difference.

Here a benchmark doing that using pgbench's script weight feature.

I've set this up so the query that hits the partitioned table runs
once for every 10k times the other script runs.  I picked that number
so the lock table was expanded fairly early on in the benchmark.

setup:
create table t1 (a int primary key);
create table hp (a int) partition by hash (a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 4096, remainder ' || x || ');' from generate_series(0,4095)
x;
\gexec

hp.sql
select count(*) from hp;

t1.sql
\set p 1
select a from t1 where a = :p;

Master = c8c885b7a5

Master:
$ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@1 postgres
SQL script 2: t1.sql
 - 1057306 transactions (100.0% of total, tps = 17621.756473)
 - 1081905 transactions (100.0% of total, tps = 18021.449914)
 - 1122420 transactions (100.0% of total, tps = 18690.804699)

Master + 0002-speed-up-LOCALLOCK-scan.patch

$ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@1 postgres
SQL script 2: t1.sql
 - 1277014 transactions (100.0% of total, tps = 21283.551615)
 - 1184052 transactions (100.0% of total, tps = 19734.185872)
 - 1188523 transactions (100.0% of total, tps = 19785.835662)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread David Rowley

On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki
 wrote:
> Thank you David for explaining.  Although I may not understand the effect of 
> "speeding up planning with partitions" patch, this patch  takes effect even 
> without it.  That is, perform the following in the same session:
>
> 1. SELECT count(*) FROM table; on a table with many partitions.  That bloats 
> the LocalLockHash.
> 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
> transaction.  Without the patch, each transaction's LockReleaseAll() has to 
> scan the bloated large hash table.

Oh. I think I see what you're saying.  Really the table in #2 would
have to be some completely different table that's not partitioned. I
think in that case it should make a difference.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Amit Langote

Tsunakawa-san,

On 2019/03/26 17:21, Tsunakawa, Takayuki wrote:
> From: David Rowley [mailto:david.row...@2ndquadrant.com]
>> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
>>  wrote:
>>> Perhaps "speeding up planning with partitions" needs to be accepted first?
>>
>> Yeah, I think it likely will require that patch to be able to measure
>> the gains from this patch.
>>
>> If planning a SELECT to a partitioned table with a large number of
>> partitions using PREPAREd statements, when we attempt the generic plan
>> on the 6th execution, it does cause the local lock table to expand to
>> fit all the locks for each partition. This does cause the
>> LockReleaseAll() to become slow due to the hash_seq_search having to
>> skip over many empty buckets.   Since generating a custom plan for a
>> partitioned table with many partitions is still slow in master, then I
>> very much imagine you'll struggle to see the gains brought by this
>> patch.
> 
> Thank you David for explaining.  Although I may not understand the effect of 
> "speeding up planning with partitions" patch, this patch  takes effect even 
> without it.  That is, perform the following in the same session:
> 
> 1. SELECT count(*) FROM table; on a table with many partitions.  That bloats 
> the LocalLockHash.
> 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
> 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
> transaction.  Without the patch, each transaction's LockReleaseAll() has to 
> scan the bloated large hash table.

My understanding of what David wrote is that the slowness of bloated hash
table is hard to notice, because planning itself is pretty slow.  With the
"speeding up planning with partitions" patch, planning becomes quite fast,
so the bloated hash table overhead and so your patch's benefit is easier
to notice.  This patch is clearly helpful, but it's just hard to notice it
when the other big bottleneck is standing in the way.

Thanks,
Amit

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-26 Thread Tsunakawa, Takayuki

From: David Rowley [mailto:david.row...@2ndquadrant.com]
> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
>  wrote:
> > Perhaps "speeding up planning with partitions" needs to be accepted first?
> 
> Yeah, I think it likely will require that patch to be able to measure
> the gains from this patch.
> 
> If planning a SELECT to a partitioned table with a large number of
> partitions using PREPAREd statements, when we attempt the generic plan
> on the 6th execution, it does cause the local lock table to expand to
> fit all the locks for each partition. This does cause the
> LockReleaseAll() to become slow due to the hash_seq_search having to
> skip over many empty buckets.   Since generating a custom plan for a
> partitioned table with many partitions is still slow in master, then I
> very much imagine you'll struggle to see the gains brought by this
> patch.

Thank you David for explaining.  Although I may not understand the effect of 
"speeding up planning with partitions" patch, this patch  takes effect even 
without it.  That is, perform the following in the same session:

1. SELECT count(*) FROM table; on a table with many partitions.  That bloats 
the LocalLockHash.
2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1;
3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate 
transaction.  Without the patch, each transaction's LockReleaseAll() has to 
scan the bloated large hash table.

Regards
Takayuki Tsunakawa

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-25 Thread David Rowley

On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut
 wrote:
> I did a bit of performance testing, both a plain pgbench and the
> suggested test case with 4096 partitions.  I can't detect any
> performance improvements.  In fact, within the noise, it tends to be
> just a bit on the slower side.
>
> So I'd like to kick it back to the patch submitter now and ask for more
> justification and performance analysis.
>
> Perhaps "speeding up planning with partitions" needs to be accepted first?

Yeah, I think it likely will require that patch to be able to measure
the gains from this patch.

If planning a SELECT to a partitioned table with a large number of
partitions using PREPAREd statements, when we attempt the generic plan
on the 6th execution, it does cause the local lock table to expand to
fit all the locks for each partition. This does cause the
LockReleaseAll() to become slow due to the hash_seq_search having to
skip over many empty buckets.   Since generating a custom plan for a
partitioned table with many partitions is still slow in master, then I
very much imagine you'll struggle to see the gains brought by this
patch.

I did a quick benchmark too and couldn't measure anything:

create table hp (a int) partition by hash (a);
select 'create table hp'||x|| ' partition of hp for values with
(modulus 4096, remainder ' || x || ');' from generate_series(0,4095)
x;

bench.sql
\set p_a 13315
select * from hp where a = :p_a;

Master:
$ pgbench -M prepared -n -T 60 -f bench.sql postgres
tps = 31.844468 (excluding connections establishing)
tps = 32.950154 (excluding connections establishing)
tps = 31.534761 (excluding connections establishing)

Patched:
$ pgbench -M prepared -n -T 60 -f bench.sql postgres
tps = 30.099133 (excluding connections establishing)
tps = 32.157328 (excluding connections establishing)
tps = 32.329884 (excluding connections establishing)

The situation won't be any better with plan_cache_mode =
force_generic_plan either. In this case, we'll only plan once but
we'll also have to obtain and release a lock for each partition for
each execution of the prepared statement. LockReleaseAll() is going to
be slow in that case because it actually has to release a lot of
locks.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-25 Thread Peter Eisentraut

On 2019-03-19 16:38, Peter Eisentraut wrote:
> On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
>> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
>>> Fixed.
>>
>> Rebased on HEAD.
> 
> I have committed the first patch that reorganizes the struct.  I'll have
> to spend some time evaluating the performance impact of the second
> patch, but it seems OK in principle.  Performance tests from others welcome.

I did a bit of performance testing, both a plain pgbench and the
suggested test case with 4096 partitions.  I can't detect any
performance improvements.  In fact, within the noise, it tends to be
just a bit on the slower side.

So I'd like to kick it back to the patch submitter now and ask for more
justification and performance analysis.

Perhaps "speeding up planning with partitions" needs to be accepted first?

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Peter Eisentraut

On 2019-03-19 10:21, Tsunakawa, Takayuki wrote:
> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
>> Fixed.
> 
> Rebased on HEAD.

I have committed the first patch that reorganizes the struct.  I'll have
to spend some time evaluating the performance impact of the second
patch, but it seems OK in principle.  Performance tests from others welcome.

-- 
Peter Eisentraut  http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki

From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com]
> Fixed.

Rebased on HEAD.


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Imai, Yoshikazu

Hi Tsunakawa-san, Peter

On Tue, Mar 19, 2019 at 7:53 AM, Tsunakawa, Takayuki wrote:
> From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> > You posted a link to some performance numbers, but I didn't see the
> > test setup explained there.  I'd like to get some more information on
> > this impact of this.  Is there an effect with 100 tables, or do you
> need 10?
> 
> Imai-san, can you tell us the test setup?

Maybe I used this test setup[1].

I tested again with those settings for prepared transactions.
I used Tsunakawa-san's patch for locallock[2] (which couldn't be applied to 
current master so I fixed it) and Amit's v32 patch for speeding up planner[3]. 

[settings]
plan_cache_mode = 'auto' or 'force_custom_plan'
max_parallel_workers = 0
max_parallel_workers_per_gather = 0
max_locks_per_transaction = 4096

[partitioning table definitions(with 4096 partitions)]
create table rt (a int, b int, c int) partition by range (a);

\o /dev/null
select 'create table rt' || x::text || ' partition of rt for values from (' ||
 (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x;
\gexec
\o

[select4096.sql]
\set a random(1, 4096)
select a from rt where a = :a;

[pgbench(with 4096 partitions)]
pgbench -n -f select4096.sql -T 60 -M prepared

[results]
  master  locallock  v32  v32+locallock
  --  -  ---  -
auto21.9   22.96,834  7,355
custom  19.7   20.07,415  7,252


[1] 
https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51256276%40g01jpexmbkw24
[2] 
https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FBDFA00%40G01JPEXMBYT05
[3] 
https://www.postgresql.org/message-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439%40lab.ntt.co.jp

--
Yoshikazu Imai

RE: Speed up transaction completion faster after many relations are accessed in a transaction

2019-03-19 Thread Tsunakawa, Takayuki

Hi Peter, Imai-san,

From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com]
> Your changes in LOCALLOCK still refer to PGPROC, from your first version
> of the patch.
> 
> I think the reordering of struct members could be done as a separate
> preliminary patch.
> 
> Some more documentation in the comment before dlist_head LocalLocks to
> explain this whole mechanism would be nice.

Fixed.


> You posted a link to some performance numbers, but I didn't see the test
> setup explained there.  I'd like to get some more information on this
> impact of this.  Is there an effect with 100 tables, or do you need 10?

Imai-san, can you tell us the test setup?


Regards
Takayuki Tsunakawa



0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch
Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch


0002-speed-up-LOCALLOCK-scan.patch
Description: 0002-speed-up-LOCALLOCK-scan.patch

1 2 >

1 - 100 of 118 matches

Mail list logo