Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Thu, 9 Nov 2023 at 21:48, Heikki Linnakangas wrote: > > On 18/09/2023 07:08, David Rowley wrote: > > On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas wrote: > >>> I've added a call to LockAssertNoneHeld(false) in there. > >> > >> I don't see it in the patch? > > > > hmm. I must've git format-patch before committing that part. > > > > I'll try that again... see attached. > > This needed a rebase after my ResourceOwner refactoring. Attached. > > A few quick comments: > > - It would be nice to add a test for the issue that you fixed in patch > v7, i.e. if you prepare a transaction while holding session-level locks. > > - GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you > are out of memory. Is that handled gracefully or is the lock leaked? CFBot shows one of the test has aborted at [1] with: [20:54:28.535] Core was generated by `postgres: subscriber: logical replication apply worker for subscription 16397 '. [20:54:28.535] Program terminated with signal SIGABRT, Aborted. [20:54:28.535] #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 [20:54:28.535] Download failed: Invalid argument. Continuing without source file ./signal/../sysdeps/unix/sysv/linux/raise.c. [20:54:28.627] [20:54:28.627] Thread 1 (Thread 0x7f0ea02d1a40 (LWP 50984)): [20:54:28.627] #0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50 ... ... [20:54:28.627] #2 0x5618e989d62f in ExceptionalCondition (conditionName=conditionName@entry=0x5618e9b40f70 "dlist_is_empty(&(MyProc->myProcLocks[i]))", fileName=fileName@entry=0x5618e9b40ec0 "../src/backend/storage/lmgr/proc.c", lineNumber=lineNumber@entry=856) at ../src/backend/utils/error/assert.c:66 [20:54:28.627] No locals. [20:54:28.627] #3 0x5618e95e6847 in ProcKill (code=, arg=) at ../src/backend/storage/lmgr/proc.c:856 [20:54:28.627] i = [20:54:28.627] proc = [20:54:28.627] procgloballist = [20:54:28.627] __func__ = "ProcKill" [20:54:28.627] #4 0x5618e959ebcc in shmem_exit (code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:276 [20:54:28.627] __func__ = "shmem_exit" [20:54:28.627] #5 0x5618e959ecd0 in proc_exit_prepare (code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:198 [20:54:28.627] __func__ = "proc_exit_prepare" [20:54:28.627] #6 0x5618e959ee8e in proc_exit (code=code@entry=1) at ../src/backend/storage/ipc/ipc.c:111 [20:54:28.627] __func__ = "proc_exit" [20:54:28.627] #7 0x5618e94aa54d in BackgroundWorkerMain () at ../src/backend/postmaster/bgworker.c:805 [20:54:28.627] local_sigjmp_buf = {{__jmpbuf = {94665009627112, -3865857745677845768, 0, 0, 140732736634980, 1, 3865354362587970296, 7379258256398875384}, __mask_was_saved = 1, __saved_mask = {__val = {18446744066192964099, 94665025527920, 94665025527920, 94665025527920, 0, 94665025528120, 8192, 1, 94664997686410, 94665009627040, 94664997622076, 94665025527920, 1, 0, 0, 140732736634980 [20:54:28.627] worker = 0x5618eb37c570 [20:54:28.627] entrypt = [20:54:28.627] __func__ = "BackgroundWorkerMain" [20:54:28.627] #8 0x5618e94b495c in do_start_bgworker (rw=rw@entry=0x5618eb3b73c8) at ../src/backend/postmaster/postmaster.c:5697 [20:54:28.627] worker_pid = [20:54:28.627] __func__ = "do_start_bgworker" [20:54:28.627] #9 0x5618e94b4c32 in maybe_start_bgworkers () at ../src/backend/postmaster/postmaster.c:5921 [20:54:28.627] rw = 0x5618eb3b73c8 [20:54:28.627] num_launched = 0 [20:54:28.627] now = 0 [20:54:28.627] iter = {cur = 0x5618eb3b79a8, next = 0x5618eb382a20, prev = 0x5618ea44a980 } [20:54:28.627] #10 0x5618e94b574a in process_pm_pmsignal () at ../src/backend/postmaster/postmaster.c:5073 [20:54:28.627] __func__ = "process_pm_pmsignal" [20:54:28.627] #11 0x5618e94b5f4a in ServerLoop () at ../src/backend/postmaster/postmaster.c:1760 [1] - https://cirrus-ci.com/task/5118173163290624?logs=cores#L51 Regards, Vignesh
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 18/09/2023 07:08, David Rowley wrote: On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas wrote: I've added a call to LockAssertNoneHeld(false) in there. I don't see it in the patch? hmm. I must've git format-patch before committing that part. I'll try that again... see attached. This needed a rebase after my ResourceOwner refactoring. Attached. A few quick comments: - It would be nice to add a test for the issue that you fixed in patch v7, i.e. if you prepare a transaction while holding session-level locks. - GrantLockLocal() now calls MemoryContextAlloc(), which can fail if you are out of memory. Is that handled gracefully or is the lock leaked? -- Heikki Linnakangas Neon (https://neon.tech) diff --git a/src/backend/access/transam/xact.c b/src/backend/access/transam/xact.c index 74ce5f9491c..8c8585be7ab 100644 --- a/src/backend/access/transam/xact.c +++ b/src/backend/access/transam/xact.c @@ -2415,6 +2415,7 @@ PrepareTransaction(void) TransactionId xid = GetCurrentTransactionId(); GlobalTransaction gxact; TimestampTz prepared_at; + HTAB *sessionandxactlocks; Assert(!IsInParallelMode()); @@ -2560,7 +2561,17 @@ PrepareTransaction(void) StartPrepare(gxact); AtPrepare_Notify(); - AtPrepare_Locks(); + + /* + * Prepare the locks and save the returned hash table that describes if + * the lock is held at the session and/or transaction level. We need to + * know if we're dealing with session locks inside PostPrepare_Locks(), + * but we're unable to build the hash table there due to that function + * only discovering if we're dealing with a session lock while we're in a + * critical section, in which case we can't allocate memory for the hash + * table. + */ + sessionandxactlocks = AtPrepare_Locks(); AtPrepare_PredicateLocks(); AtPrepare_PgStat(); AtPrepare_MultiXact(); @@ -2587,7 +2598,10 @@ PrepareTransaction(void) * ProcArrayClearTransaction(). Otherwise, a GetLockConflicts() would * conclude "xact already committed or aborted" for our locks. */ - PostPrepare_Locks(xid); + PostPrepare_Locks(xid, sessionandxactlocks); + + /* We no longer need this hash table */ + hash_destroy(sessionandxactlocks); /* * Let others know about no transaction in progress by me. This has to be diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c index 296dc82d2ee..5baf83ac6ce 100644 --- a/src/backend/commands/discard.c +++ b/src/backend/commands/discard.c @@ -71,7 +71,12 @@ DiscardAll(bool isTopLevel) ResetAllOptions(); DropAllPreparedStatements(); Async_UnlistenAll(); - LockReleaseAll(USER_LOCKMETHOD, true); + LockReleaseSession(USER_LOCKMETHOD); + +#ifdef USE_ASSERT_CHECKING + LockAssertNoneHeld(false); +#endif + ResetPlanCache(); ResetTempTableNamespace(); ResetSequenceCaches(); diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 501910b4454..835f112f751 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -841,7 +841,11 @@ logicalrep_worker_onexit(int code, Datum arg) * The locks will be acquired once the worker is initialized. */ if (!InitializingApplyWorker) - LockReleaseAll(DEFAULT_LOCKMETHOD, true); + LockReleaseSession(DEFAULT_LOCKMETHOD); + +#ifdef USE_ASSERT_CHECKING + LockAssertNoneHeld(false); +#endif ApplyLauncherWakeup(); } diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README index 45de0fd2bd6..e7e0f29347a 100644 --- a/src/backend/storage/lmgr/README +++ b/src/backend/storage/lmgr/README @@ -182,12 +182,6 @@ holdMask - subset of the PGPROC object's heldLocks mask (if the PGPROC is currently waiting for another lock mode on this lock). -releaseMask - -A bitmask for the lock modes due to be released during LockReleaseAll. -This must be a subset of the holdMask. Note that it is modified without -taking the partition LWLock, and therefore it is unsafe for any -backend except the one owning the PROCLOCK to examine/change it. - lockLink - List link for shared memory queue of all the PROCLOCK objects for the same LOCK. diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index b8c57b3e16e..ce54b589b0d 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -22,8 +22,7 @@ * Interface: * * InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(), - * LockAcquire(), LockRelease(), LockReleaseAll(), - * LockCheckConflicts(), GrantLock() + * LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock() * *- */ @@ -162,6 +161,16 @@ typedef struct TwoPhaseLockRecord LOCKMODE lockmode; } TwoPhaseLockRecord; +/* + * Used by for a hash table entry type in AtPrepare_Locks() to communicate the + * session/xact lock status of each held lock for use in PostPrepare_Locks(). + */ +typedef struct
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, 15 Sept 2023 at 22:37, Heikki Linnakangas wrote: > > I've added a call to LockAssertNoneHeld(false) in there. > > I don't see it in the patch? hmm. I must've git format-patch before committing that part. I'll try that again... see attached. David v8-0001-wip-resowner-lock-release-all.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 11/09/2023 15:00, David Rowley wrote: On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas wrote: index 296dc82d2ee..edb8b6026e5 100644 --- a/src/backend/commands/discard.c +++ b/src/backend/commands/discard.c @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) Async_UnlistenAll(); - LockReleaseAll(USER_LOCKMETHOD, true); + LockReleaseSession(USER_LOCKMETHOD); ResetPlanCache(); This assumes that there are no transaction-level advisory locks. I think that's OK. It took me a while to convince myself of that, though. I think we need a high level comment somewhere that explains what assumptions we make on which locks can be held in session mode and which in transaction mode. Isn't it ok because DISCARD ALL cannot run inside a transaction block, so there should be no locks taken apart from possibly session-level locks? Hmm, sounds valid. I think I convinced myself that it's OK through some other reasoning, but I don't remember it now. I've added a call to LockAssertNoneHeld(false) in there. I don't see it in the patch? -- Heikki Linnakangas Neon (https://neon.tech)
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Thank you for having a look at this. Apologies for not getting back to you sooner. On Wed, 5 Jul 2023 at 21:44, Heikki Linnakangas wrote: > > On 10/02/2023 04:51, David Rowley wrote: > > I've attached another set of patches. I do need to spend longer > > looking at this. I'm mainly attaching these as CI seems to be > > highlighting a problem that I'm unable to recreate locally and I > > wanted to see if the attached fixes it. > > I like this patch's approach. > > > index 296dc82d2ee..edb8b6026e5 100644 > > --- a/src/backend/commands/discard.c > > +++ b/src/backend/commands/discard.c > > @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) > > Async_UnlistenAll(); > > - LockReleaseAll(USER_LOCKMETHOD, true); > > + LockReleaseSession(USER_LOCKMETHOD); > > ResetPlanCache(); > > This assumes that there are no transaction-level advisory locks. I think > that's OK. It took me a while to convince myself of that, though. I > think we need a high level comment somewhere that explains what > assumptions we make on which locks can be held in session mode and which > in transaction mode. Isn't it ok because DISCARD ALL cannot run inside a transaction block, so there should be no locks taken apart from possibly session-level locks? I've added a call to LockAssertNoneHeld(false) in there. > > @@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid) > > Assert(lock->nGranted <= lock->nRequested); > > Assert((proclock->holdMask & ~lock->grantMask) == > > 0); > > > > - /* Ignore it if nothing to release (must be a > > session lock) */ > > - if (proclock->releaseMask == 0) > > - continue; > > - > > - /* Else we should be releasing all locks */ > > - if (proclock->releaseMask != proclock->holdMask) > > - elog(PANIC, "we seem to have dropped a bit > > somewhere"); > > - > > /* > > * We cannot simply modify proclock->tag.myProc to > > reassign > > * ownership of the lock, because that's part of > > the hash key and > > This looks wrong. If you prepare a transaction that is holding any > session locks, we will now transfer them to the prepared transaction. > And its locallock entry will be out of sync. To fix, I think we could > keep around the hash table that CheckForSessionAndXactLocks() builds, > and use that here. Good catch. I've modified the patch to keep the hash table built in CheckForSessionAndXactLocks around for longer so that we can check for session locks. I've attached an updated patch mainly to get CI checking this. I suspect something is wrong as subscription/015_stream is timing out. I've not gotten to the bottom of that yet. David v7-0001-wip-resowner-lock-release-all.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 10/02/2023 04:51, David Rowley wrote: I've attached another set of patches. I do need to spend longer looking at this. I'm mainly attaching these as CI seems to be highlighting a problem that I'm unable to recreate locally and I wanted to see if the attached fixes it. I like this patch's approach. index 296dc82d2ee..edb8b6026e5 100644 --- a/src/backend/commands/discard.c +++ b/src/backend/commands/discard.c @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) Async_UnlistenAll(); - LockReleaseAll(USER_LOCKMETHOD, true); + LockReleaseSession(USER_LOCKMETHOD); ResetPlanCache(); This assumes that there are no transaction-level advisory locks. I think that's OK. It took me a while to convince myself of that, though. I think we need a high level comment somewhere that explains what assumptions we make on which locks can be held in session mode and which in transaction mode. @@ -3224,14 +3206,6 @@ PostPrepare_Locks(TransactionId xid) Assert(lock->nGranted <= lock->nRequested); Assert((proclock->holdMask & ~lock->grantMask) == 0); - /* Ignore it if nothing to release (must be a session lock) */ - if (proclock->releaseMask == 0) - continue; - - /* Else we should be releasing all locks */ - if (proclock->releaseMask != proclock->holdMask) - elog(PANIC, "we seem to have dropped a bit somewhere"); - /* * We cannot simply modify proclock->tag.myProc to reassign * ownership of the lock, because that's part of the hash key and This looks wrong. If you prepare a transaction that is holding any session locks, we will now transfer them to the prepared transaction. And its locallock entry will be out of sync. To fix, I think we could keep around the hash table that CheckForSessionAndXactLocks() builds, and use that here. -- Heikki Linnakangas Neon (https://neon.tech)
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Thanks for having a look at this. On Wed, 1 Feb 2023 at 03:07, Amit Langote wrote: > Maybe you're planning to do it once this patch is post the PoC phase > (isn't it?), but it would be helpful to have commentary on all the new > dlist fields. I've added comments on the new fields. Maybe we can say the patch is "wip". > This seems to be replacing what is a cache with an upper limit on the > number of cacheds locks with something that has no limit on how many > per-owner locks are remembered. I wonder whether we'd be doing > additional work in some cases with the new no-limit implementation > that wasn't being done before (where the owner's locks array is > overflowed) or maybe not much because the new implementation of > ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist > node from the owner's dlist? It's a good question. The problem is I don't really have a good test to find out. The problem is we'd need to benchmark taking fewer than 16 locks. On trying that, I find that there's just too much variability in the performance between runs to determine if there's any slowdown. $ cat 10_locks.sql select count(pg_advisory_lock(x)) from generate_series(1,10) x; $ pgbench -f 10_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps)" tps = 47809.306088 (without initial connection time) tps = 66859.789072 (without initial connection time) tps = 37885.924616 (without initial connection time) On trying with more locks, I see there are good wins from the patched version. $ cat 100_locks.sql select count(pg_advisory_lock(x)) from generate_series(1,100) x; $ cat 1k_locks.sql select count(pg_advisory_lock(x)) from generate_series(1,1000) x; $ cat 10k_locks.sql select count(pg_advisory_lock(x)) from generate_series(1,1) x; Test 1: Take 100 locks but periodically take 10k locks to bloat the local lock table. master: $ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10 -n postgres | grep -E "(tps|script)" transaction type: multiple scripts tps = 2726.197037 (without initial connection time) SQL script 1: 100_locks.sql - 27219 transactions (99.9% of total, tps = 2722.496227) SQL script 2: 10k_locks.sql - 37 transactions (0.1% of total, tps = 3.700810) patched: $ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10 -n postgres | grep -E "(tps|script)" transaction type: multiple scripts tps = 34047.297822 (without initial connection time) SQL script 1: 100_locks.sql - 340039 transactions (99.9% of total, tps = 34012.688879) SQL script 2: 10k_locks.sql - 346 transactions (0.1% of total, tps = 34.608943) patched without slab context: $ pgbench -f 100_locks.sql@1000 -f 10k_locks.sql@1 -M prepared -T 10 -n postgres | grep -E "(tps|script)" transaction type: multiple scripts tps = 34851.770846 (without initial connection time) SQL script 1: 100_locks.sql - 348097 transactions (99.9% of total, tps = 34818.662324) SQL script 2: 10k_locks.sql - 331 transactions (0.1% of total, tps = 33.108522) Test 2: Always take just 100 locks and don't bloat the local lock table. master: $ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps|script)" tps = 32682.491548 (without initial connection time) patched: $ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps|script)" tps = 35637.241815 (without initial connection time) patched without slab context: $ pgbench -f 100_locks.sql@1000 -M prepared -T 10 -n postgres | grep -E "(tps|script)" tps = 36192.185181 (without initial connection time) The attached 0003 patch is an experiment to see if using a slab memory context has any advantages for storing the LOCALLOCKOWNER structs. There seems to be a small performance hit from doing this. > The following comment is now obsolete: > > /* > * LockReassignCurrentOwner > * Reassign all locks belonging to CurrentResourceOwner to belong > * to its parent resource owner. > * > * If the caller knows what those locks are, it can pass them as an array. > * That speeds up the call significantly, when a lot of locks are held > * (e.g pg_dump with a large schema). Otherwise, pass NULL for locallocks, > * and we'll traverse through our hash table to find them. > */ I've removed the obsolete part. I've attached another set of patches. I do need to spend longer looking at this. I'm mainly attaching these as CI seems to be highlighting a problem that I'm unable to recreate locally and I wanted to see if the attached fixes it. David From 4a546ad6d33e544dd872b23a94925a262088cd9a Mon Sep 17 00:00:00 2001 From: David Rowley Date: Tue, 24 Jan 2023 16:00:58 +1300 Subject: [PATCH v6 1/3] wip-resowner-lock-release-all. --- src/backend/commands/discard.c | 2 +- src/backend/replication/logical/launcher.c | 2 +- src/backend/storage/lmgr/README| 6 - src/backend/storage/lmgr/lock.c| 650 ++--- src/backend/storage/lmgr/proc.c| 17 +-
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi, On 2023-01-24 16:57:37 +1300, David Rowley wrote: > I've attached a rebased patch. Looks like there's some issue causing tests to fail probabilistically: https://cirrus-ci.com/github/postgresql-cfbot/postgresql/commitfest%2F42%2F3501 Several failures are when testing a 32bit build. > While reading over this again, I wondered if instead of allocating the > memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should > create a Slab context as a child of TopMemoryContext and perform the > allocations there. Yes, that does make sense. > I would like to get this LockReleaseAll problem finally fixed in PG16, > but I'd feel much better about this patch if it had some review from > someone who has more in-depth knowledge of the locking code. I feel my review wouldn't be independent, but I'll give it a shot if nobody else does. Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi David, On Tue, Jan 24, 2023 at 12:58 PM David Rowley wrote: > On Fri, 20 Jan 2023 at 00:26, vignesh C wrote: > > CFBot shows some compilation errors as in [1], please post an updated > > version for the same: > > I've attached a rebased patch. Thanks for the new patch. Maybe you're planning to do it once this patch is post the PoC phase (isn't it?), but it would be helpful to have commentary on all the new dlist fields. Especially, I think the following warrants a bit more explanation than other: - /* We can remember up to MAX_RESOWNER_LOCKS references to local locks. */ - int nlocks; /* number of owned locks */ - LOCALLOCK *locks[MAX_RESOWNER_LOCKS]; /* list of owned locks */ + dlist_head locks; /* dlist of owned locks */ This seems to be replacing what is a cache with an upper limit on the number of cacheds locks with something that has no limit on how many per-owner locks are remembered. I wonder whether we'd be doing additional work in some cases with the new no-limit implementation that wasn't being done before (where the owner's locks array is overflowed) or maybe not much because the new implementation of ResourceOwner{Remember|Forget}Lock() is simple push/delete of a dlist node from the owner's dlist? The following comment is now obsolete: /* * LockReassignCurrentOwner * Reassign all locks belonging to CurrentResourceOwner to belong * to its parent resource owner. * * If the caller knows what those locks are, it can pass them as an array. * That speeds up the call significantly, when a lot of locks are held * (e.g pg_dump with a large schema). Otherwise, pass NULL for locallocks, * and we'll traverse through our hash table to find them. */ -- Thanks, Amit Langote EDB: http://www.enterprisedb.com
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, 20 Jan 2023 at 00:26, vignesh C wrote: > CFBot shows some compilation errors as in [1], please post an updated > version for the same: I've attached a rebased patch. While reading over this again, I wondered if instead of allocating the memory for the LOCALLOCKOWNER in TopMemoryContext, maybe we should create a Slab context as a child of TopMemoryContext and perform the allocations there. I feel like slab might be a better option here as it'll use slightly less memory due to it not rounding up allocations to the next power of 2. sizeof(LOCALLOCKOWNER) == 56, so it's not a great deal of memory, but more than nothing. The primary reason that I think this might be a good idea is mostly around better handling of chunk on block fragmentation in slab.c than aset.c. If we have transactions which create a large number of locks then we may end up growing the TopMemoryContext and never releasing the AllocBlocks and just having a high number of 64-byte chunks left on the freelist that'll maybe never be used again. I'm thinking slab.c might handle that better as it'll only keep around 10 completely empty SlabBlocks before it'll start free'ing them. The slab allocator is quite a bit faster now as a result of d21ded75f. I would like to get this LockReleaseAll problem finally fixed in PG16, but I'd feel much better about this patch if it had some review from someone who has more in-depth knowledge of the locking code. I've also gone and adjusted all the places that upgraded the elog(WARNING)s of local table corruption to PANIC and put them back to use WARNING again. While I think it might be a good idea to do that, it seems to be adding a bit more resistance to this patch which I don't think it really needs. Maybe we can consider that in a separate effort. David diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c index 296dc82d2e..edb8b6026e 100644 --- a/src/backend/commands/discard.c +++ b/src/backend/commands/discard.c @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) ResetAllOptions(); DropAllPreparedStatements(); Async_UnlistenAll(); - LockReleaseAll(USER_LOCKMETHOD, true); + LockReleaseSession(USER_LOCKMETHOD); ResetPlanCache(); ResetTempTableNamespace(); ResetSequenceCaches(); diff --git a/src/backend/replication/logical/launcher.c b/src/backend/replication/logical/launcher.c index 564bffe5ca..20b2e3497e 100644 --- a/src/backend/replication/logical/launcher.c +++ b/src/backend/replication/logical/launcher.c @@ -798,7 +798,7 @@ logicalrep_worker_onexit(int code, Datum arg) * parallel apply mode and will not be released when the worker * terminates, so manually release all locks before the worker exits. */ - LockReleaseAll(DEFAULT_LOCKMETHOD, true); + LockReleaseSession(DEFAULT_LOCKMETHOD); ApplyLauncherWakeup(); } diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README index d08ec6c402..9603cc8959 100644 --- a/src/backend/storage/lmgr/README +++ b/src/backend/storage/lmgr/README @@ -182,12 +182,6 @@ holdMask - subset of the PGPROC object's heldLocks mask (if the PGPROC is currently waiting for another lock mode on this lock). -releaseMask - -A bitmask for the lock modes due to be released during LockReleaseAll. -This must be a subset of the holdMask. Note that it is modified without -taking the partition LWLock, and therefore it is unsafe for any -backend except the one owning the PROCLOCK to examine/change it. - lockLink - List link for shared memory queue of all the PROCLOCK objects for the same LOCK. diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 49d62a0dc7..9d9d27e0c9 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -22,8 +22,7 @@ * Interface: * * InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(), - * LockAcquire(), LockRelease(), LockReleaseAll(), - * LockCheckConflicts(), GrantLock() + * LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock() * *- */ @@ -289,6 +288,9 @@ static LOCALLOCK *awaitedLock; static ResourceOwner awaitedOwner; +static dlist_head session_locks[lengthof(LockMethods)]; + + #ifdef LOCK_DEBUG /*-- @@ -375,8 +377,8 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner); static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode); static void FinishStrongLockAcquire(void); static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner); -static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock); -static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent); +static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock); +static void LockReassignOwner(LOCALLOCKOWNER *locallockowner,
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Wed, 3 Aug 2022 at 09:04, David Rowley wrote: > > On Wed, 3 Aug 2022 at 07:04, Jacob Champion wrote: > > This entry has been waiting on author input for a while (our current > > threshold is roughly two weeks), so I've marked it Returned with > > Feedback. > > Thanks for taking care of this. You dealt with this correctly based on > the fact that I'd failed to rebase before or during the entire July > CF. > > I'm still interested in having the LockReleaseAll slowness fixed, so > here's a rebased patch. CFBot shows some compilation errors as in [1], please post an updated version for the same: [15:40:00.287] [1239/1809] Linking target src/backend/postgres [15:40:00.287] FAILED: src/backend/postgres [15:40:00.287] cc @src/backend/postgres.rsp [15:40:00.287] /usr/bin/ld: src/backend/postgres_lib.a.p/replication_logical_launcher.c.o: in function `logicalrep_worker_onexit': [15:40:00.287] /tmp/cirrus-ci-build/build/../src/backend/replication/logical/launcher.c:773: undefined reference to `LockReleaseAll' [15:40:00.287] collect2: error: ld returned 1 exit status [1] - https://cirrus-ci.com/task/4562493863886848 Regards, Vignesh
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Thank you for looking at the patch. On Fri, 4 Nov 2022 at 04:43, Ankit Kumar Pandey wrote: > I don't see any performance improvement in tests. Are you able to share what your test was? In order to see a performance improvement you're likely going to have to obtain a large number of locks in the session so that the local lock table becomes bloated, then continue to run some fast query and observe that LockReleaseAll has become slower as a result of the hash table becoming bloated. Running pgbench running a SELECT on a hash partitioned table with a good number of partitions to look up a single row with -M prepared. The reason this becomes slow is that the planner will try a generic plan on the 6th execution which will lock every partition and bloat the local lock table. From then on it will use a custom plan which only locks a single leaf partition. I just tried the following: $ pgbench -i --partition-method=hash --partitions=1000 postgres Master: $ pgbench -T 60 -S -M prepared postgres | grep tps tps = 21286.172326 (without initial connection time) Patched: $ pgbench -T 60 -S -M prepared postgres | grep tps tps = 23034.063261 (without initial connection time) If I try again with 10,000 partitions, I get: Master: $ pgbench -T 60 -S -M prepared postgres | grep tps tps = 13044.290903 (without initial connection time) Patched: $ pgbench -T 60 -S -M prepared postgres | grep tps tps = 22683.545686 (without initial connection time) David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi David, This is review of speed up releasing of locks patch. Contents & Purpose: Subject is missing in patch. It would have been easier to understand purpose had it been included. Included in the patch are change in README, but no new tests are included.. Initial Run: The patch applies cleanly to HEAD. The regression tests all pass successfully against the new patch. Nitpicking & conclusion: I don't see any performance improvement in tests. Lots of comments were removed which were not fully replaced. Change of log level for ReleaseLockIfHeld: failed from warning to panic is mystery. Change in readme doesn't look right. `Any subsequent lockers are share lockers wait waiting for the VXID to terminate via some other method) is for deadlock`. This sentence could be rewritten. Also more comments could be added to explain new methods added. Thanks, Ankit
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Wed, 3 Aug 2022 at 07:04, Jacob Champion wrote: > This entry has been waiting on author input for a while (our current > threshold is roughly two weeks), so I've marked it Returned with > Feedback. Thanks for taking care of this. You dealt with this correctly based on the fact that I'd failed to rebase before or during the entire July CF. I'm still interested in having the LockReleaseAll slowness fixed, so here's a rebased patch. David diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c index c583539e0c..2c43b9e0c8 100644 --- a/src/backend/commands/discard.c +++ b/src/backend/commands/discard.c @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) ResetAllOptions(); DropAllPreparedStatements(); Async_UnlistenAll(); - LockReleaseAll(USER_LOCKMETHOD, true); + LockReleaseSession(USER_LOCKMETHOD); ResetPlanCache(); ResetTempTableNamespace(); ResetSequenceCaches(); diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README index d08ec6c402..563ba681e5 100644 --- a/src/backend/storage/lmgr/README +++ b/src/backend/storage/lmgr/README @@ -182,12 +182,6 @@ holdMask - subset of the PGPROC object's heldLocks mask (if the PGPROC is currently waiting for another lock mode on this lock). -releaseMask - -A bitmask for the lock modes due to be released during LockReleaseAll. -This must be a subset of the holdMask. Note that it is modified without -taking the partition LWLock, and therefore it is unsafe for any -backend except the one owning the PROCLOCK to examine/change it. - lockLink - List link for shared memory queue of all the PROCLOCK objects for the same LOCK. @@ -321,8 +315,7 @@ and will notice any weak lock we take when it does. Fast-path VXID locks do not use the FastPathStrongRelationLocks table. The first lock taken on a VXID is always the ExclusiveLock taken by its owner. -Any subsequent lockers are share lockers waiting for the VXID to terminate. -Indeed, the only reason VXID locks use the lock manager at all (rather than +Any subsequent lockers are share lockers wait waiting for the VXID to terminate via some other method) is for deadlock detection. Thus, the initial VXID lock can *always* be taken via the fast path without checking for conflicts. Any subsequent locker must check diff --git a/src/backend/storage/lmgr/lock.c b/src/backend/storage/lmgr/lock.c index 5f5803f681..381a2527f7 100644 --- a/src/backend/storage/lmgr/lock.c +++ b/src/backend/storage/lmgr/lock.c @@ -22,8 +22,7 @@ * Interface: * * InitLocks(), GetLocksMethodTable(), GetLockTagsMethodTable(), - * LockAcquire(), LockRelease(), LockReleaseAll(), - * LockCheckConflicts(), GrantLock() + * LockAcquire(), LockRelease(), LockCheckConflicts(), GrantLock() * *- */ @@ -289,6 +288,9 @@ static LOCALLOCK *awaitedLock; static ResourceOwner awaitedOwner; +static dlist_head session_locks[lengthof(LockMethods)]; + + #ifdef LOCK_DEBUG /*-- @@ -375,8 +377,9 @@ static void GrantLockLocal(LOCALLOCK *locallock, ResourceOwner owner); static void BeginStrongLockAcquire(LOCALLOCK *locallock, uint32 fasthashcode); static void FinishStrongLockAcquire(void); static void WaitOnLock(LOCALLOCK *locallock, ResourceOwner owner); -static void ReleaseLockIfHeld(LOCALLOCK *locallock, bool sessionLock); -static void LockReassignOwner(LOCALLOCK *locallock, ResourceOwner parent); +static void ReleaseLockIfHeld(LOCALLOCKOWNER *locallockowner, bool sessionLock); +static void LockReassignOwner(LOCALLOCKOWNER *locallockowner, + ResourceOwner parent); static bool UnGrantLock(LOCK *lock, LOCKMODE lockmode, PROCLOCK *proclock, LockMethod lockMethodTable); static void CleanUpLock(LOCK *lock, PROCLOCK *proclock, @@ -477,6 +480,10 @@ InitLocks(void) 16, , HASH_ELEM | HASH_BLOBS); + + /* Initialize each element of the session_locks array */ + for (int i = 0; i < lengthof(LockMethods); i++) + dlist_init(_locks[i]); } @@ -701,7 +708,7 @@ LockHasWaiters(const LOCKTAG *locktag, LOCKMODE lockmode, bool sessionLock) { PROCLOCK_PRINT("LockHasWaiters: WRONGTYPE", proclock); LWLockRelease(partitionLock); - elog(WARNING, "you don't own a lock of type %s", + elog(PANIC, "you don't own a lock of type %s", lockMethodTable->lockModeNames[lockmode]); RemoveLocalLock(locallock); return false; @@ -839,26 +846,9 @@ LockAcquireExtended(const LOCKTAG
Re: Speed up transaction completion faster after many relations are accessed in a transaction
This entry has been waiting on author input for a while (our current threshold is roughly two weeks), so I've marked it Returned with Feedback. Once you think the patchset is ready for review again, you (or any interested party) can resurrect the patch entry by visiting https://commitfest.postgresql.org/38/3501/ and changing the status to "Needs Review", and then changing the status again to "Move to next CF". (Don't forget the second step; hopefully we will have streamlined this in the near future!) Thanks, --Jacob
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Wed, 6 Apr 2022 at 03:40, Yura Sokolov wrote: > I'm looking on patch and don't get some moments. > > `GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into > `locallock->locallockowners`. It links it regardless `owner` could be > NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != > NULL);`. > Why it should not fail? > > `GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`. > But there is single `pfree(locallockowner)` in `LockReassignOwner`. > Looks like there should be more `pfree`. Shouldn't they? > > `GrantLockLocal` does `dlist_push_tail`, but isn't it better to > do `dlist_push_head`? Resource owners usually form stack, so usually > when owner searches for itself it is last added to list. > Then `dlist_foreach` will find it sooner if it were added to the head. Thanks for having a look at this. It's a bit unrealistic for me to get a look at addressing these for v15. I've pushed this one out to the next CF. David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Good day, David. I'm looking on patch and don't get some moments. `GrantLockLocal` allocates `LOCALLOCKOWNER` and links it into `locallock->locallockowners`. It links it regardless `owner` could be NULL. But then `RemoveLocalLock` does `Assert(locallockowner->owner != NULL);`. Why it should not fail? `GrantLockLocal` allocates `LOCALLOCKOWNER` in `TopMemoryContext`. But there is single `pfree(locallockowner)` in `LockReassignOwner`. Looks like there should be more `pfree`. Shouldn't they? `GrantLockLocal` does `dlist_push_tail`, but isn't it better to do `dlist_push_head`? Resource owners usually form stack, so usually when owner searches for itself it is last added to list. Then `dlist_foreach` will find it sooner if it were added to the head. regards - Yura Sokolov Postgres Professional y.soko...@postgrespro.ru funny.fal...@gmail.com
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Sat, 1 Jan 2022 at 15:40, Zhihong Yu wrote: > + locallock->nLocks -= locallockowner->nLocks; > + Assert(locallock->nLocks >= 0); > > I think the assertion is not needed since the above code is in if block : > > + if (locallockowner->nLocks < locallock->nLocks) > > the condition, locallock->nLocks >= 0, would always hold after the > subtraction. That makes sense. I've removed the Assert in the attached patch. Thanks for looking at the patch I've also spent a bit more time on the patch. There were quite a few outdated comments remaining. Also, the PROCLOCK releaseMask field appears to no longer be needed. I also did a round of benchmarking on the attached patch using a very recent master. I've attached .sql files and the script I used to benchmark. With 1024 partitions, lock1.sql shows about a 4.75% performance increase. This would become larger with more partitions and less with fewer partitions. With the patch, lock2.sql about a 10% performance increase over master. lock3.sql does not seem to have changed much and lock4.sql shows a small regression with the patch of about 2.5%. I'm not quite sure how worried we should be about lock4.sql slowing down lightly. 2.5% is fairly small given how hard I'm exercising the locking code in that test. There's also nothing much to say that the slowdown is not just due to code alignment changes. I also understand that Amit L is working on another patch that will improve the situation for lock1.sql by not taking the locks on relations that will be run-time pruned at executor startup. I think it's still worth solving this regardless of Amit's patch as with current master we still have a situation where short fast queries which access a small number of tables can become slower once the backend has obtained a large number of locks concurrently and bloated the locallocktable. As for the patch. I feel it's a pretty invasive change to how we release locks and the resowner code. I'd be quite happy for some review of it. Here are the full results as output by the attached script. drowley@amd3990x:~$ echo master master drowley@amd3990x:~$ ./lockbench.sh lock1.sql tps = 38078.011433 (without initial connection time) tps = 38070.016792 (without initial connection time) tps = 39223.118769 (without initial connection time) tps = 37510.105287 (without initial connection time) tps = 38164.394128 (without initial connection time) lock2.sql tps = 247.963797 (without initial connection time) tps = 247.374174 (without initial connection time) tps = 248.412744 (without initial connection time) tps = 248.192629 (without initial connection time) tps = 248.503728 (without initial connection time) lock3.sql tps = 1162.937692 (without initial connection time) tps = 1160.968689 (without initial connection time) tps = 1166.908643 (without initial connection time) tps = 1160.288547 (without initial connection time) tps = 1160.336572 (without initial connection time) lock4.sql tps = 282.173560 (without initial connection time) tps = 284.470330 (without initial connection time) tps = 286.089644 (without initial connection time) tps = 285.548487 (without initial connection time) tps = 284.313505 (without initial connection time) drowley@amd3990x:~$ echo Patched Patched drowley@amd3990x:~$ ./lockbench.sh lock1.sql tps = 40338.975219 (without initial connection time) tps = 39803.433365 (without initial connection time) tps = 39504.824194 (without initial connection time) tps = 39843.422438 (without initial connection time) tps = 40624.483013 (without initial connection time) lock2.sql tps = 274.413309 (without initial connection time) tps = 271.978813 (without initial connection time) tps = 275.795091 (without initial connection time) tps = 273.628649 (without initial connection time) tps = 273.049977 (without initial connection time) lock3.sql tps = 1168.557054 (without initial connection time) tps = 1168.139469 (without initial connection time) tps = 1166.366440 (without initial connection time) tps = 1165.464214 (without initial connection time) tps = 1167.250809 (without initial connection time) lock4.sql tps = 274.842298 (without initial connection time) tps = 277.911394 (without initial connection time) tps = 278.702620 (without initial connection time) tps = 275.715606 (without initial connection time) tps = 278.816060 (without initial connection time) David diff --git a/src/backend/commands/discard.c b/src/backend/commands/discard.c index c583539e0c..2c43b9e0c8 100644 --- a/src/backend/commands/discard.c +++ b/src/backend/commands/discard.c @@ -71,7 +71,7 @@ DiscardAll(bool isTopLevel) ResetAllOptions(); DropAllPreparedStatements(); Async_UnlistenAll(); - LockReleaseAll(USER_LOCKMETHOD, true); + LockReleaseSession(USER_LOCKMETHOD); ResetPlanCache(); ResetTempTableNamespace(); ResetSequenceCaches(); diff --git a/src/backend/storage/lmgr/README b/src/backend/storage/lmgr/README index
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, Dec 31, 2021 at 5:45 PM David Rowley wrote: > On Fri, 3 Dec 2021 at 20:36, Michael Paquier wrote: > > Two months later, this has been switched to RwF. > > I was discussing this patch with Andres. He's not very keen on my > densehash hash table idea and suggested that instead of relying on > trying to make the hash table iteration faster, why don't we just > ditch the lossiness of the resource owner code which only records the > first 16 locks, and just make it have a linked list of all locks. > > This is a little more along the lines of the original patch, however, > it does not increase the size of the LOCALLOCK struct. > > I've attached a patch which does this. This was mostly written > (quickly) by Andres, I just did a cleanup, fixed up a few mistakes and > fixed a few bugs. > > I ran the same performance tests on this patch as I did back in [1]: > > -- Test 1. Select 1 record from a 140 partitioned table. Tests > creating a large number of locks with a fast query. > > create table hp (a int, b int) partition by hash(a); > select 'create table hp'||x||' partition of hp for values with > (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; > create index on hp (b); > insert into hp select x,x from generate_series(1, 14) x; > analyze hp; > > select3.sql: select * from hp where b = 1 > > select3.sql master > > drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres > tps = 2099.708748 (without initial connection time) > tps = 2100.398516 (without initial connection time) > tps = 2094.882341 (without initial connection time) > tps = 2113.218820 (without initial connection time) > tps = 2104.717597 (without initial connection time) > > select3.sql patched > > drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres > tps = 2010.070738 (without initial connection time) > tps = 1994.963606 (without initial connection time) > tps = 1994.668849 (without initial connection time) > tps = 1995.948168 (without initial connection time) > tps = 1985.650945 (without initial connection time) > > You can see that there's a performance regression here. I've not yet > studied why this appears. > > -- Test 2. Tests a prepared query which will perform a generic plan on > the 6th execution then fallback on a custom plan due to it pruning all > but one partition. Master suffers from the lock table becoming > bloated after locking all partitions when planning the generic plan. > > create table ht (a int primary key, b int, c int) partition by hash (a); > select 'create table ht' || x::text || ' partition of ht for values > with (modulus 8192, remainder ' || (x)::text || ');' from > generate_series(0,8191) x; > \gexec > > select.sql: > \set p 1 > select * from ht where a = :p > > select.sql master > drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 18014.460090 (without initial connection time) > tps = 17973.358889 (without initial connection time) > tps = 17847.480647 (without initial connection time) > tps = 18038.332507 (without initial connection time) > tps = 17776.143206 (without initial connection time) > > select.sql patched > drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 32393.457106 (without initial connection time) > tps = 32277.204349 (without initial connection time) > tps = 32160.719830 (without initial connection time) > tps = 32530.038130 (without initial connection time) > tps = 32299.019657 (without initial connection time) > > You can see that there are some quite good performance gains with this > test. > > I'm going to add this to the January commitfest. > > David > > [1] > https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com Hi, + locallock->nLocks -= locallockowner->nLocks; + Assert(locallock->nLocks >= 0); I think the assertion is not needed since the above code is in if block : + if (locallockowner->nLocks < locallock->nLocks) the condition, locallock->nLocks >= 0, would always hold after the subtraction. Cheers
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, 3 Dec 2021 at 20:36, Michael Paquier wrote: > Two months later, this has been switched to RwF. I was discussing this patch with Andres. He's not very keen on my densehash hash table idea and suggested that instead of relying on trying to make the hash table iteration faster, why don't we just ditch the lossiness of the resource owner code which only records the first 16 locks, and just make it have a linked list of all locks. This is a little more along the lines of the original patch, however, it does not increase the size of the LOCALLOCK struct. I've attached a patch which does this. This was mostly written (quickly) by Andres, I just did a cleanup, fixed up a few mistakes and fixed a few bugs. I ran the same performance tests on this patch as I did back in [1]: -- Test 1. Select 1 record from a 140 partitioned table. Tests creating a large number of locks with a fast query. create table hp (a int, b int) partition by hash(a); select 'create table hp'||x||' partition of hp for values with (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; create index on hp (b); insert into hp select x,x from generate_series(1, 14) x; analyze hp; select3.sql: select * from hp where b = 1 select3.sql master drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2099.708748 (without initial connection time) tps = 2100.398516 (without initial connection time) tps = 2094.882341 (without initial connection time) tps = 2113.218820 (without initial connection time) tps = 2104.717597 (without initial connection time) select3.sql patched drowley@amd3990x:~$ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2010.070738 (without initial connection time) tps = 1994.963606 (without initial connection time) tps = 1994.668849 (without initial connection time) tps = 1995.948168 (without initial connection time) tps = 1985.650945 (without initial connection time) You can see that there's a performance regression here. I've not yet studied why this appears. -- Test 2. Tests a prepared query which will perform a generic plan on the 6th execution then fallback on a custom plan due to it pruning all but one partition. Master suffers from the lock table becoming bloated after locking all partitions when planning the generic plan. create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p select.sql master drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 18014.460090 (without initial connection time) tps = 17973.358889 (without initial connection time) tps = 17847.480647 (without initial connection time) tps = 18038.332507 (without initial connection time) tps = 17776.143206 (without initial connection time) select.sql patched drowley@amd3990x:~$ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 32393.457106 (without initial connection time) tps = 32277.204349 (without initial connection time) tps = 32160.719830 (without initial connection time) tps = 32530.038130 (without initial connection time) tps = 32299.019657 (without initial connection time) You can see that there are some quite good performance gains with this test. I'm going to add this to the January commitfest. David [1] https://www.postgresql.org/message-id/CAKJS1f8Lt0kS4bb5EH%3DhV%2BksqBZNnmVa8jujoYBYu5PVhWbZZg%40mail.gmail.com speedup_releasing_all_locks.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, Oct 01, 2021 at 04:03:09PM +0900, Michael Paquier wrote: > This last update was two months ago, and the patch has not moved > since: > https://commitfest.postgresql.org/34/3220/ > > Do you have plans to work more on that or perhaps the CF entry should > be withdrawn or RwF'd? Two months later, this has been switched to RwF. -- Michael signature.asc Description: PGP signature
Re: Speed up transaction completion faster after many relations are accessed in a transaction
I've made some remarks in related thread: https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1FB976EF@G01JPEXMBYT05 The new status of this patch is: Waiting on Author
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Tue, Jul 20, 2021 at 05:04:19PM +1200, David Rowley wrote: > On Mon, 12 Jul 2021 at 19:23, David Rowley wrote: > > I also adjusted the hash seq scan code so that it performs better when > > faced a non-sparsely populated table. Previously my benchmark for > > that case didn't do well [2]. > > I was running some select only pgbench tests today on an AMD 3990x > machine with a large number of processes. > > I saw that LockReleaseAll was coming up on the profile a bit at: This last update was two months ago, and the patch has not moved since: https://commitfest.postgresql.org/34/3220/ Do you have plans to work more on that or perhaps the CF entry should be withdrawn or RwF'd? -- Michael signature.asc Description: PGP signature
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 12 Jul 2021 at 19:23, David Rowley wrote: > I also adjusted the hash seq scan code so that it performs better when > faced a non-sparsely populated table. Previously my benchmark for > that case didn't do well [2]. I was running some select only pgbench tests today on an AMD 3990x machine with a large number of processes. I saw that LockReleaseAll was coming up on the profile a bit at: Master: 0.77% postgres [.] LockReleaseAll I wondered if this patch would help, so I tried it and got: dense hash lockrelease all: 0.67% postgres [.] LockReleaseAll It's a very small increase which translated to about a 0.62% gain in tps. It made me think it might be worth doing something about this LockReleaseAll can show up when releasing small numbers of locks. pgbench -T 240 -P 10 -c 132 -j 132 -S -M prepared --random-seed=12345 postgres Units = tps Secmaster dense hash LockReleaseAll 10 3758201.2 3713521.5 98.81% 20 3810125.5 3844142.9 100.89% 30 3806505.1 3848458101.10% 40 3816094.8 3855706.6 101.04% 50 3820317.2 3851717.7 100.82% 60 3827809 3851499.4100.62% 70 3828757.9 3849312100.54% 80 3824492.1 3852378.8 100.73% 90 3816502.1 3854793.8 101.00% 100 3819124.1 3860418.6 101.08% 110 3816154.3 3845327.7 100.76% 120 3817070.5 3845842.5 100.75% 130 3815424.7 3847626100.84% 140 3823631.1 3846760.6 100.60% 150 3820963.8 3840196.6 100.50% 160 38277373841149.3 100.35% 170 3827779.2 3840130.9 100.32% 180 3829352 3842814.5 100.35% 190 3825518.3 3841991100.43% 200 3823477.2 3839390.7 100.42% 210 3809304.3 3836433.5 100.71% 220 3814328.5 3842073.7 100.73% 230 3811399.3 3843780.7 100.85% avg 3816959.53 3840672.478 100.62% David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 21 Jun 2021 at 01:56, David Rowley wrote: > # A new hashtable implementation > Also, it would be good to hear what people think about solving the > problem this way. Because over on [1] I'm also trying to improve the performance of smgropen(), I posted the patch for the new hash table over there too. Between that thread and discussions with Thomas and Andres off-list, I get the idea that pretty much nobody likes the idea of having another hash table implementation. Thomas wants to solve it another way and Andres has concerns that working with bitmaps is too slow. Andres suggested freelists would be faster, but I'm not really a fan of the idea as, unless I have a freelist per array segment then I won't be able to quickly identify the earliest segment slot to re-fill unused slots with. That would mean memory would get more fragmented over time instead of the fragmentation being slowly fixed as new items are added after deletes. So I've not really tried implementing that to see how it performs. Both Andres and Thomas expressed a dislike to the name "generichash" too. Anyway, since I did make a few small changes to the hash table implementation before doing all that off-list talking, I thought I should at least post where I got to here so that anything else that comes up can get compared to where I got to, instead of where I was with this. I did end up renaming the hash table to "densehash" rather than generichash. Naming is hard, but I went with dense as memory density was on my mind when I wrote it. Having a compact 8-byte bucket width and packing the data into arrays in a dense way. The word dense came up a few times, so went with that. I also adjusted the hash seq scan code so that it performs better when faced a non-sparsely populated table. Previously my benchmark for that case didn't do well [2]. I've attached the benchmark results from running the benchmark that's included in hashbench.tar.bz2. I ran this 10 times using the included test.sh with ./test.sh 10. I included the results I got on my AMD machine in the attached bz2 file in results.csv. You can see from the attached dense_vs_generic_vs_simple.png that dense hash is quite comparable to simplehash for inserts/deletes and lookups. It's not quite as fast as simplehash at iterations when the table is not bloated, but blows simplehash out the water when the hashtables have become bloated due to having once contained a large number of records but no longer do. Anyway, unless there is some interest in me taking this idea further then, due to the general feedback received on [1], I'm not planning on pushing this any further. I'll leave the commitfest entry as is for now to give others a chance to see this. David [1] https://postgr.es/m/CAApHDvowgRaQupC%3DL37iZPUzx1z7-N8deD7TxQSm8LR%2Bf4L3-A%40mail.gmail.com [2] https://www.postgresql.org/message-id/CAApHDvpuzJTQNKQ_bnAccvi-68xuh%2Bv87B4P6ycU-UiN0dqyTg%40mail.gmail.com hashbench.tar.bz2 Description: Binary data densehash_for_lockreleaseall.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Thanks for having a look at this. On Mon, 21 Jun 2021 at 05:02, Zhihong Yu wrote: > + * GH_ELEMENT_TYPE defines the data type that the hashtable stores. Each > + * instance of GH_ELEMENT_TYPE which is stored in the hash table is done so > + * inside a GH_SEGMENT. > > I think the second sentence can be written as (since done means stored, it is > redundant): I've rewords this entire paragraph slightly. > Each instance of GH_ELEMENT_TYPE is stored in the hash table inside a > GH_SEGMENT. > > + * Macros for translating a bucket's index into the segment and another to > + * determine the item number within the segment. > + */ > +#define GH_INDEX_SEGMENT(i)(i) / GH_ITEMS_PER_SEGMENT > > into the segment -> into the segment number (in the code I see segidx but I > wonder if segment index may cause slight confusion). I've adjusted this comment > + GH_BITMAP_WORD used_items[GH_BITMAP_WORDS]; /* A 1-bit for each used item > +* in the items array */ > > 'A 1-bit' -> One bit (A and 1 mean the same) I think you might have misread this. We're storing a 1-bit for each used item rather than a 0-bit. If I remove the 'A' then it's not clear what the meaning of each bit's value is. > + uint32 first_free_segment; > > Since the segment may not be totally free, maybe name the field > first_segment_with_free_slot I don't really like that. It feels too long to me. > + * This is similar to GH_NEXT_ONEBIT but flips the bits before operating on > + * each GH_BITMAP_WORD. > > It seems the only difference from GH_NEXT_ONEBIT is in this line: > > + GH_BITMAP_WORD word = ~(words[wordnum] & mask); /* flip bits */ > > If a 4th parameter is added to signify whether the flipping should be done, > these two functions can be unified. I don't want to do that. I'd rather have them separate to ensure the compiler does not create any additional needless branching. Those functions are pretty hot. > +* next insert will store in this segment. If it's already pointing to an > +* earlier segment, then leave it be. > > The last sentence is confusing: first_free_segment cannot point to some > segment and earlier segment at the same time. > Maybe drop the last sentence. I've adjusted this comment to become: * Check if we need to lower the next_free_segment. We want new inserts * to fill the lower segments first, so only change the first_free_segment * if the removed entry was stored in an earlier segment. Thanks for having a look at this. I'll attach an updated patch soon. David
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Sun, Jun 20, 2021 at 6:56 AM David Rowley wrote: > On Wed, 14 Aug 2019 at 19:25, David Rowley > wrote: > > For now, I'm out of ideas. If anyone else feels like suggesting > > something of picking this up, feel free. > > This is a pretty old thread, so we might need a recap: > > # Recap > > Basically LockReleaseAll() becomes slow after a large number of locks > have all been held at once by a backend. The problem is that the > LockMethodLocalHash dynahash table must grow to store all the locks > and when later transactions only take a few locks, LockReleaseAll() is > slow due to hash_seq_search() having to skip the sparsely filled > buckets in the bloated hash table. > > The following things were tried on this thread. Each one failed: > > 1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK. > Simply loop over the dlist in LockReleaseAll(). > 2) Try dropping and recreating the dynahash table if it becomes > bloated using some heuristics to decide what "bloated" means and if > recreating is worthwhile. > > #1 failed due to concerns with increasing the size of LOCALLOCK to > store the dlist pointers. Performance regressions were seen too. > Possibly due to size increase or additional overhead from pushing onto > the dlist. > #2 failed because it was difficult to agree on what the heuristics > would be and we had no efficient way to determine the maximum number > of locks that a given transaction held at any one time. We only know > how many were still held at LockReleaseAll(). > > There were also some suggestions to fix dynahash's hash_seq_search() > slowness, and also a suggestion to try using simplehash.h instead of > dynahash.c. Unfortunately simplehash.h would suffer the same issues as > it too would have to skip over empty buckets in a sparsely populated > hash table. > > I'd like to revive this effort as I have a new idea on how to solve the > problem. > > # Background > > Over in [1] I'm trying to improve the performance of smgropen() during > recovery. The slowness there comes from the dynahash table lookups to > find the correct SMgrRelation. Over there I proposed to use simplehash > instead of dynahash because it's quite a good bit faster and far > lessens the hash lookup overhead during recovery. One problem on that > thread is that relcache keeps a pointer into the SMgrRelation > (RelationData.rd_smgr) and because simplehash moves things around > during inserts and deletes, then we can't have anything point to > simplehash entries, they're unstable. I fixed that over on the other > thread by having the simplehash entry point to a palloced > SMgrRelationData... My problem is, I don't really like that idea as it > means we need to palloc() pfree() lots of little chunks of memory. > > To fix the above, I really think we need a version of simplehash that > has stable pointers. Providing that implementation is faster than > dynahash, then it will help solve the smgropen() slowness during > recovery. > > # A new hashtable implementation > > I ended up thinking of this thread again because the implementation of > the stable pointer hash that I ended up writing for [1] happens to be > lightning fast for hash table sequential scans, even if the table has > become bloated. The reason the seq scans are so fast is that the > implementation loops over the data arrays, which are tightly packed > and store the actual data rather than pointers to the data. The code > does not need to loop over the bucket array for this at all, so how > large that has become is irrelevant to hash table seq scan > performance. > > The patch stores elements in "segments" which is set to some power of > 2 value. When we run out of space to store new items in a segment, we > just allocate another segment. When we remove items from the table, > new items reuse the first unused item in the first segment with free > space. This helps to keep the used elements tightly packed. A > segment keeps a bitmap of used items so that means scanning all used > items is very fast. If you flip the bits in the used_item bitmap, > then you get a free list for that segment, so it's also very fast to > find a free element when inserting a new item into the table. > > I've called the hash table implementation "generichash". It uses the > same preprocessor tricks as simplehash.h does and borrows the same > linear probing code that's used in simplehash. The bucket array just > stores the hash value and a uint32 index into the segment item that > stores the data. Since segments store a power of 2 items, we can > easily address both the segment number and the item within the segment > from the single uint32 index value. The 'index' field just stores a > special value when the bucket is empty. No need to add another field > for that. This means the bucket type is just 8 bytes wide. > > One thing I will mention about the new hash table implementation is > that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means > that's
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Wed, 14 Aug 2019 at 19:25, David Rowley wrote: > For now, I'm out of ideas. If anyone else feels like suggesting > something of picking this up, feel free. This is a pretty old thread, so we might need a recap: # Recap Basically LockReleaseAll() becomes slow after a large number of locks have all been held at once by a backend. The problem is that the LockMethodLocalHash dynahash table must grow to store all the locks and when later transactions only take a few locks, LockReleaseAll() is slow due to hash_seq_search() having to skip the sparsely filled buckets in the bloated hash table. The following things were tried on this thread. Each one failed: 1) Use a dlist in LOCALLOCK to track the next and prev LOCALLOCK. Simply loop over the dlist in LockReleaseAll(). 2) Try dropping and recreating the dynahash table if it becomes bloated using some heuristics to decide what "bloated" means and if recreating is worthwhile. #1 failed due to concerns with increasing the size of LOCALLOCK to store the dlist pointers. Performance regressions were seen too. Possibly due to size increase or additional overhead from pushing onto the dlist. #2 failed because it was difficult to agree on what the heuristics would be and we had no efficient way to determine the maximum number of locks that a given transaction held at any one time. We only know how many were still held at LockReleaseAll(). There were also some suggestions to fix dynahash's hash_seq_search() slowness, and also a suggestion to try using simplehash.h instead of dynahash.c. Unfortunately simplehash.h would suffer the same issues as it too would have to skip over empty buckets in a sparsely populated hash table. I'd like to revive this effort as I have a new idea on how to solve the problem. # Background Over in [1] I'm trying to improve the performance of smgropen() during recovery. The slowness there comes from the dynahash table lookups to find the correct SMgrRelation. Over there I proposed to use simplehash instead of dynahash because it's quite a good bit faster and far lessens the hash lookup overhead during recovery. One problem on that thread is that relcache keeps a pointer into the SMgrRelation (RelationData.rd_smgr) and because simplehash moves things around during inserts and deletes, then we can't have anything point to simplehash entries, they're unstable. I fixed that over on the other thread by having the simplehash entry point to a palloced SMgrRelationData... My problem is, I don't really like that idea as it means we need to palloc() pfree() lots of little chunks of memory. To fix the above, I really think we need a version of simplehash that has stable pointers. Providing that implementation is faster than dynahash, then it will help solve the smgropen() slowness during recovery. # A new hashtable implementation I ended up thinking of this thread again because the implementation of the stable pointer hash that I ended up writing for [1] happens to be lightning fast for hash table sequential scans, even if the table has become bloated. The reason the seq scans are so fast is that the implementation loops over the data arrays, which are tightly packed and store the actual data rather than pointers to the data. The code does not need to loop over the bucket array for this at all, so how large that has become is irrelevant to hash table seq scan performance. The patch stores elements in "segments" which is set to some power of 2 value. When we run out of space to store new items in a segment, we just allocate another segment. When we remove items from the table, new items reuse the first unused item in the first segment with free space. This helps to keep the used elements tightly packed. A segment keeps a bitmap of used items so that means scanning all used items is very fast. If you flip the bits in the used_item bitmap, then you get a free list for that segment, so it's also very fast to find a free element when inserting a new item into the table. I've called the hash table implementation "generichash". It uses the same preprocessor tricks as simplehash.h does and borrows the same linear probing code that's used in simplehash. The bucket array just stores the hash value and a uint32 index into the segment item that stores the data. Since segments store a power of 2 items, we can easily address both the segment number and the item within the segment from the single uint32 index value. The 'index' field just stores a special value when the bucket is empty. No need to add another field for that. This means the bucket type is just 8 bytes wide. One thing I will mention about the new hash table implementation is that GH_ITEMS_PER_SEGMENT is, by default, set to 256. This means that's the minimum size for the table. I could drop this downto 64, but that's still quite a bit more than the default size of the dynahash table of 16. I think 16 is a bit on the low side and that it might be worth making this 64 anyway.
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi, the patch was in WoA since December, waiting for a rebase. I've marked it as returned with feedback. Feel free to re-submit an updated version into the next CF. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Thu, Sep 26, 2019 at 07:11:53AM +, Tsunakawa, Takayuki wrote: > I'm sorry to repeat what I mentioned in my previous mail, but my v2 > patch's approach is based on the database textbook and seems > intuitive. So I attached the rebased version. If you wish to do so, that's fine by me but I have not dived into the details of the thread much. Please not anyway that the patch does not apply anymore and that it needs a rebase. So for now I have moved the patch to next CF, waiting on author. -- Michael signature.asc Description: PGP signature
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > On 2019-Sep-03, Tsunakawa, Takayuki wrote: > > I don't think it's rejected. It would be a pity (mottainai) to refuse > > this, because it provides significant speedup despite its simple > > modification. > > I don't necessarily disagree with your argumentation, but Travis is > complaining thusly: I tried to revise David's latest patch (v8) and address Tom's comments in his last mail. But I'm a bit at a loss. First, to accurately count the maximum number of acquired locks in a transaction, we need to track the maximum entries in the hash table, and make it available via a new function like hash_get_max_entries(). However, to cover the shared partitioned hash table (that is not necessary for LockMethodLocalHash), we must add a spin lock in hashhdr and lock/unlock it when entering and removing entries in the hash table. It spoils the effort to decrease contention by hashhdr->freelists[].mutex. Do we want to track the maximum number of acquired locks in the global variable in lock.c, not in the hash table? Second, I couldn't understand the comment about the fill factor well. I can understand that it's not correct to compare the number of hash buckets and the number of locks. But what can we do? I'm sorry to repeat what I mentioned in my previous mail, but my v2 patch's approach is based on the database textbook and seems intuitive. So I attached the rebased version. Regards Takayuki Tsunakawa faster-locallock-scan_v3.patch Description: faster-locallock-scan_v3.patch
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 2019-Sep-03, Tsunakawa, Takayuki wrote: > From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > > Hmm ... is this patch rejected, or is somebody still trying to get it to > > committable state? David, you're listed as committer. > > I don't think it's rejected. It would be a pity (mottainai) to refuse > this, because it provides significant speedup despite its simple > modification. I don't necessarily disagree with your argumentation, but Travis is complaining thusly: gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Werror=vla -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -g -O2 -Wall -Werror -I../../../../src/include -I/usr/include/x86_64-linux-gnu -D_GNU_SOURCE -c -o lock.o lock.c 1840lock.c:486:1: error: conflicting types for ‘TryShrinkLocalLockHash’ 1841 TryShrinkLocalLockHash(long numLocksHeld) 1842 ^ 1843lock.c:351:20: note: previous declaration of ‘TryShrinkLocalLockHash’ was here 1844 static inline void TryShrinkLocalLockHash(void); 1845^ 1846: recipe for target 'lock.o' failed Please fix. -- Álvaro Herrerahttps://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Alvaro Herrera [mailto:alvhe...@2ndquadrant.com] > Hmm ... is this patch rejected, or is somebody still trying to get it to > committable state? David, you're listed as committer. I don't think it's rejected. It would be a pity (mottainai) to refuse this, because it provides significant speedup despite its simple modification. Again, I think the v2 patch is OK. Tom's comment was as follows: [Tom's comment against v2] FWIW, I tried this patch against current HEAD (959d00e9d). Using the test case described by Amit at I do measure an undeniable speedup, close to 35%. However ... I submit that that's a mighty extreme test case. (I had to increase max_locks_per_transaction to get it to run at all.) We should not be using that sort of edge case to drive performance optimization choices. If I reduce the number of partitions in Amit's example from 8192 to something more real-world, like 128, I do still measure a performance gain, but it's ~ 1.5% which is below what I'd consider a reproducible win. I'm accustomed to seeing changes up to 2% in narrow benchmarks like this one, even when "nothing changes" except unrelated code. Trying a standard pgbench test case (pgbench -M prepared -S with one client and an -s 10 database), it seems that the patch is about 0.5% slower than HEAD. Again, that's below the noise threshold, but it's not promising for the net effects of this patch on workloads that aren't specifically about large and prunable partition sets. I'm also fairly concerned about the effects of the patch on sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88 bytes, a 22% increase. That's a lot if you're considering cases with many locks. On the whole I don't think there's an adequate case for committing this patch. I'd also point out that this is hardly the only place where we've seen hash_seq_search on nearly-empty hash tables become a bottleneck. So I'm not thrilled about attacking that with one-table-at-time patches. I'd rather see us do something to let hash_seq_search win across the board. * Extreme test case: Not extreme. Two of our customers, who are considering to use PostgreSQL, are using thousands of partitions now. We hit this issue -- a point query gets nearly 20% slower after automatically creating a generic plan. That's the reason for this proposal. * 0.5% slowdown with pgbench: I think it's below the noise, as Tom said. * sizeof(LOCALLOCK): As Andres replied to Tom in the immediately following mail, LOCALLOCK was bigger in PG 11. * Use case is narrow: No. The bloated LockMethodLocalHash affects the performance of the items below as well as transaction commit/abort: - AtPrepare_Locks() and PostPrepare_Locks(): the hash table is scanned twice in PREPARE! - LockReleaseSession: advisory lock - LockReleaseCurrentOwner: ?? - LockReassignCurrentOwner: ?? Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Wed, Aug 14, 2019 at 07:25:10PM +1200, David Rowley wrote: On Thu, 25 Jul 2019 at 05:49, Tom Lane wrote: On the whole, I don't especially like this approach, because of the confusion between peak lock count and end-of-xact lock count. That seems way too likely to cause problems. Thanks for having a look at this. I've not addressed the points you've mentioned due to what you mention above. The only way I can think of so far to resolve that would be to add something to track peak lock usage. The best I can think of to do that, short of adding something to dynahash.c is to check how many locks are held each time we obtain a lock, then if that count is higher than the previous time we checked, then update the maximum locks held, (probably a global variable). That seems pretty horrible to me and adds overhead each time we obtain a lock, which is a pretty performance-critical path. Would it really be a measurable overhead? I mean, we only really need one int counter, and you don't need to do the check on every lock acquisition - you just need to recheck on the first lock release. But maybe I'm underestimating how expensive it is ... Talking about dynahash - doesn't it already track this information? Maybe not directly but surely it has to track the number of entries in the hash table, in order to compute fill factor. Can't we piggy-back on that and track the highest fill-factor for a particular period of time? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Thu, 25 Jul 2019 at 05:49, Tom Lane wrote: > On the whole, I don't especially like this approach, because of the > confusion between peak lock count and end-of-xact lock count. That > seems way too likely to cause problems. Thanks for having a look at this. I've not addressed the points you've mentioned due to what you mention above. The only way I can think of so far to resolve that would be to add something to track peak lock usage. The best I can think of to do that, short of adding something to dynahash.c is to check how many locks are held each time we obtain a lock, then if that count is higher than the previous time we checked, then update the maximum locks held, (probably a global variable). That seems pretty horrible to me and adds overhead each time we obtain a lock, which is a pretty performance-critical path. I've not tested what Andres mentioned about simplehash instead of dynahash yet. I did a quick scan of simplehash and it looked like SH_START_ITERATE would suffer the same problems as dynahash's hash_seq_search(), albeit, perhaps with some more efficient memory lookups. i.e it still has to skip over empty buckets, which might be costly in a bloated table. For now, I'm out of ideas. If anyone else feels like suggesting something of picking this up, feel free. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Thu, Jul 25, 2019 at 5:49 AM Tom Lane wrote: > David Rowley writes: > > Here's a more polished version with the debug code removed, complete > > with benchmarks. > > A few gripes: > > [gripes] Based on the above, I've set this to "Waiting on Author", in the next CF. -- Thomas Munro https://enterprisedb.com
Re: Speed up transaction completion faster after many relations are accessed in a transaction
David Rowley writes: > Here's a more polished version with the debug code removed, complete > with benchmarks. A few gripes: You're measuring the number of locks held at completion of the transaction, which fails to account for locks transiently taken and released, so that the actual peak usage will be more. I think we mostly only do that for system catalog accesses, so *maybe* the number of extra locks involved isn't very large, but that's a very shaky assumption. I don't especially like the fact that this seems to have a hard-wired (and undocumented) assumption that buckets == entries, ie that the fillfactor of the table is set at 1.0. lock.c has no business knowing that. Perhaps instead of returning the raw bucket count, you could have dynahash.c return bucket count times fillfactor, so that the number is in the same units as the entry count. This: running_avg_locks -= running_avg_locks / 10.0; running_avg_locks += numLocksHeld / 10.0; seems like a weird way to do the calculation. Personally I'd write running_avg_locks += (numLocksHeld - running_avg_locks) / 10.0; which is more the way I'm used to seeing exponential moving averages computed --- at least, it seems clearer to me why this will converge towards the average value of numLocksHeld over time. It also makes it clear that it wouldn't be sane to use two different divisors, whereas your formulation makes it look like maybe they could be set independently. Your compiler might not complain that LockReleaseAll's numLocksHeld is potentially uninitialized, but other people's compilers will. On the whole, I don't especially like this approach, because of the confusion between peak lock count and end-of-xact lock count. That seems way too likely to cause problems. regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Wed, 24 Jul 2019 at 16:16, David Rowley wrote: > > On Wed, 24 Jul 2019 at 15:05, David Rowley > wrote: > > To be able to reduce the threshold down again we'd need to make a > > hash_get_num_entries(LockMethodLocalHash) call before performing the > > guts of LockReleaseAll(). We could then weight that onto some running > > average counter with a weight of, say... 10, so we react to changes > > fairly quickly, but not instantly. We could then have some sort of > > logic like "rebuild the hash table if running average 4 times less > > than max_bucket" > > > > I've attached a spreadsheet of that idea and the algorithm we could > > use to track the running average. Initially, I've mocked it up a > > series of transactions that use 1000 locks, then at row 123 dropped > > that to 10 locks. If we assume the max_bucket is 1000, then it takes > > until row 136 for the running average to drop below the max_bucket > > count, i.e 13 xacts. There we'd reset there at the init size of 16. If > > the average went up again, then we'd automatically expand the table as > > we do now. To make this work we'd need an additional call to > > hash_get_num_entries(), before we release the locks, so there is more > > overhead. > > Here's a patch with this implemented. I've left a NOTICE in there to > make it easier for people to follow along at home and see when the > lock table is reset. Here's a more polished version with the debug code removed, complete with benchmarks. -- Test 1. Select 1 record from a 140 partitioned table. Tests creating a large number of locks with a fast query. create table hp (a int, b int) partition by hash(a); select 'create table hp'||x||' partition of hp for values with (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; create index on hp (b); insert into hp select x,x from generate_series(1, 14) x; analyze hp; select3.sql: select * from hp where b = 1 - Master: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2098.628975 (excluding connections establishing) tps = 2101.797672 (excluding connections establishing) tps = 2085.317292 (excluding connections establishing) tps = 2094.931999 (excluding connections establishing) tps = 2092.328908 (excluding connections establishing) Patched: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2101.691459 (excluding connections establishing) tps = 2104.533249 (excluding connections establishing) tps = 2106.499123 (excluding connections establishing) tps = 2104.033459 (excluding connections establishing) tps = 2105.463629 (excluding connections establishing) (I'm surprised there is not more overhead in the additional tracking added to calculate the running average) -- Test 2. Tests a prepared query which will perform a generic plan on the 6th execution then fallback on a custom plan due to it pruning all but one partition. Master suffers from the lock table becoming bloated after locking all partitions when planning the generic plan. create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p Master: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 10207.780843 (excluding connections establishing) tps = 10205.772688 (excluding connections establishing) tps = 10214.896449 (excluding connections establishing) tps = 10157.572153 (excluding connections establishing) tps = 10147.764477 (excluding connections establishing) Patched: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 14677.636570 (excluding connections establishing) tps = 14661.437186 (excluding connections establishing) tps = 14647.202877 (excluding connections establishing) tps = 14784.165753 (excluding connections establishing) tps = 14850.355344 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable_v8.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi, On 2019-07-21 21:37:28 +1200, David Rowley wrote: > select.sql: > \set p 1 > select * from ht where a = :p > > Master: > > $ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 10172.035036 (excluding connections establishing) > tps = 10192.780529 (excluding connections establishing) > tps = 10331.306003 (excluding connections establishing) > > Patched: > > $ pgbench -n -f select.sql -T 60 -M prepared postgres > tps = 15080.765549 (excluding connections establishing) > tps = 14994.404069 (excluding connections establishing) > tps = 14982.923480 (excluding connections establishing) > > That seems fine, 46% faster. > > v6 is attached. > > I plan to push this in a few days unless someone objects. It does seem far less objectionable than the other case. I hate to throw in one more wrench into a topic finally making progress, but: Have either of you considered just replacing the dynahash table with a simplehash style one? Given the obvious speed sensitivity, and the fact that for it (in contrast to the shared lock table) no partitioning is needed, that seems like a good thing to try. It seems quite possible that both the iteration and plain manipulations are going to be faster, due to far less indirections - e.g. the iteration through the array will just be an array walk with a known stride, far easier for the CPU to prefetch. Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Wed, 24 Jul 2019 at 15:05, David Rowley wrote: > To be able to reduce the threshold down again we'd need to make a > hash_get_num_entries(LockMethodLocalHash) call before performing the > guts of LockReleaseAll(). We could then weight that onto some running > average counter with a weight of, say... 10, so we react to changes > fairly quickly, but not instantly. We could then have some sort of > logic like "rebuild the hash table if running average 4 times less > than max_bucket" > > I've attached a spreadsheet of that idea and the algorithm we could > use to track the running average. Initially, I've mocked it up a > series of transactions that use 1000 locks, then at row 123 dropped > that to 10 locks. If we assume the max_bucket is 1000, then it takes > until row 136 for the running average to drop below the max_bucket > count, i.e 13 xacts. There we'd reset there at the init size of 16. If > the average went up again, then we'd automatically expand the table as > we do now. To make this work we'd need an additional call to > hash_get_num_entries(), before we release the locks, so there is more > overhead. Here's a patch with this implemented. I've left a NOTICE in there to make it easier for people to follow along at home and see when the lock table is reset. There will be a bit of additional overhead to the reset detection logic over the v7 patch. Namely: additional hash_get_num_entries() call before releasing the locks, and more complex and floating-point maths instead of more simple integer maths in v7. Here's a demo with the debug NOTICE in there to show us what's going on. -- setup create table a (a int) partition by range (a); select 'create table a'||x||' partition of a for values from('||x||') to ('||x+1||');' from generate_Series(1,1000) x; \gexec $ psql postgres NOTICE: max_bucket = 15, threshold = 64.00, running_avg_locks 0.10 Reset? No psql (13devel) # \o /dev/null # select * from a where a < 100; NOTICE: max_bucket = 101, threshold = 64.00, running_avg_locks 10.09 Reset? Yes # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 76.324005, running_avg_locks 19.081001 Reset? Yes # select * from a where a < 100; A couple of needless resets there... Maybe we can get rid of those by setting the initial running average up to something higher than 0.0. NOTICE: max_bucket = 99, threshold = 108.691605, running_avg_locks 27.172901 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 137.822449, running_avg_locks 34.455612 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 164.040207, running_avg_locks 41.010052 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 187.636185, running_avg_locks 46.909046 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 208.872559, running_avg_locks 52.218140 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 227.985306, running_avg_locks 56.996326 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 245.186768, running_avg_locks 61.296692 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 260.668091, running_avg_locks 65.167023 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 274.601288, running_avg_locks 68.650322 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 287.141174, running_avg_locks 71.785294 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 298.427063, running_avg_locks 74.606766 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 308.584351, running_avg_locks 77.146088 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 317.725922, running_avg_locks 79.431480 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 325.953339, running_avg_locks 81.488335 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 333.358002, running_avg_locks 83.339500 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 340.022217, running_avg_locks 85.005554 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 346.019989, running_avg_locks 86.504997 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 351.417999, running_avg_locks 87.854500 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 356.276184, running_avg_locks 89.069046 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 360.648560, running_avg_locks 90.162140 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 364.583710, running_avg_locks 91.145927 Reset? No # select * from a where a < 100; NOTICE: max_bucket = 99, threshold = 368.125336, running_avg_locks 92.031334 Reset? No # select * from a
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Thanks for having a look at this. On Wed, 24 Jul 2019 at 04:13, Tom Lane wrote: > > David Rowley writes: > > I'm pretty happy with v7 now. If anyone has any objections to it, > > please speak up very soon. Otherwise, I plan to push it about this > > time tomorrow. > > I dunno, this seems close to useless in this form. As it stands, > once hash_get_max_bucket has exceeded the threshold, you will > arbitrarily reset the table 1000 transactions later (since the > max bucket is certainly not gonna decrease otherwise). So that's > got two big problems: > > 1. In the assumed-common case where most of the transactions take > few locks, you wasted cycles for 999 transactions. > > 2. You'll reset the table independently of subsequent history, > even if the session's usage is pretty much always over the > threshold. Admittedly, if you do this only once per 1K > transactions, it's probably not a horrible overhead --- but you > can't improve point 1 without making it a bigger overhead. This is true, but I think you might be overestimating just how much effort is wasted with #1. We're only seeing this overhead in small very fast to execute xacts. In the test case in [1], I was getting about 10k TPS unpatched and about 15k patched. This means that, on average, 1 unpatched xact takes 100 microseconds and 1 average patched xact takes 66 microseconds, so the additional time spent doing the hash_seq_search() must be about 34 microseconds. So we'll waste a total of 34 *milliseconds* if we wait for 1000 xacts before we reset the table. With 10k TPS we're going to react to the change in 0.1 seconds. I think you'd struggle to measure that 1 xact is taking 34 microseconds longer without running a few thousand queries. My view is that nobody is ever going to notice that it takes 1k xacts to shrink the table, and I've already shown that the overhead of the shrink every 1k xact is tiny. I mentioned 0.34% in [1] using v6. This is likely a bit smaller in v7 due to swapping the order of the if condition to put the less likely case first. Since the overhead of rebuilding the table was 7% when done every xact, then it stands to reason that this has become 0.007% doing it every 1k xats and that the additional overhead to make up that 0.34% is from testing if the reset condition has been met (or noise). That's not something we can remove completely with any solution that resets the hash table. > I did complain about the cost of tracking the stats proposed by > some earlier patches, but I don't think the answer to that is to > not track any stats at all. The proposed hash_get_max_bucket() > function is quite cheap, so I think it wouldn't be out of line to > track the average value of that at transaction end over the > session's lifespan, and reset if the current value is more than > some-multiple of the average. > > The tricky part here is that if some xact kicks that up to > say 64 entries, and we don't choose to reset, then the reading > for subsequent transactions will be 64 even if they use very > few locks. So we probably need to not use a plain average, > but account for that effect somehow. Maybe we could look at > how quickly the number goes up after we reset? > > [ thinks for awhile... ] As a straw-man proposal, I suggest > the following (completely untested) idea: > > * Make the table size threshold variable, not constant. > > * If, at end-of-transaction when the table is empty, > the table bucket count exceeds the threshold, reset > immediately; but if it's been less than K transactions > since the last reset, increase the threshold (by say 10%). > > I think K can be a constant; somewhere between 10 and 100 would > probably work. At process start, we should act as though the last > reset were more than K transactions ago (i.e., don't increase the > threshold at the first reset). I think the problem with this idea is that there is no way once the threshold has been enlarged to recover from that to work better workloads that require very few locks again. If we end up with some large value for the variable threshold, there's no way to decrease that again. All this method stops is the needless hash table resets if the typical case requires many locks. The only way to know if we can reduce the threshold again is to count the locks released during LockReleaseAll(). Some version of the patch did that, and you objected. > The main advantage this has over v7 is that we don't have the > 1000-transaction delay before reset, which ISTM is giving up > much of the benefit of the whole idea. Also, if the session > is consistently using lots of locks, we'll adapt to that after > awhile and not do useless table resets. True, but you neglected to mention the looming and critical drawback, which pretty much makes that idea useless. All we'd need to do is give this a workload that throws that variable threshold up high so that it can't recover. It would be pretty simple then to show that LockReleaseAll() is still slow with
Re: Speed up transaction completion faster after many relations are accessed in a transaction
David Rowley writes: > I've attached v7, which really is v6 with some comments adjusted and > the order of the hash_get_num_entries and hash_get_max_bucket function > calls swapped. I think hash_get_num_entries() will return 0 most of > the time where we're calling it, so it makes sense to put the > condition that's less likely to be true first in the if condition. > I'm pretty happy with v7 now. If anyone has any objections to it, > please speak up very soon. Otherwise, I plan to push it about this > time tomorrow. I dunno, this seems close to useless in this form. As it stands, once hash_get_max_bucket has exceeded the threshold, you will arbitrarily reset the table 1000 transactions later (since the max bucket is certainly not gonna decrease otherwise). So that's got two big problems: 1. In the assumed-common case where most of the transactions take few locks, you wasted cycles for 999 transactions. 2. You'll reset the table independently of subsequent history, even if the session's usage is pretty much always over the threshold. Admittedly, if you do this only once per 1K transactions, it's probably not a horrible overhead --- but you can't improve point 1 without making it a bigger overhead. I did complain about the cost of tracking the stats proposed by some earlier patches, but I don't think the answer to that is to not track any stats at all. The proposed hash_get_max_bucket() function is quite cheap, so I think it wouldn't be out of line to track the average value of that at transaction end over the session's lifespan, and reset if the current value is more than some-multiple of the average. The tricky part here is that if some xact kicks that up to say 64 entries, and we don't choose to reset, then the reading for subsequent transactions will be 64 even if they use very few locks. So we probably need to not use a plain average, but account for that effect somehow. Maybe we could look at how quickly the number goes up after we reset? [ thinks for awhile... ] As a straw-man proposal, I suggest the following (completely untested) idea: * Make the table size threshold variable, not constant. * If, at end-of-transaction when the table is empty, the table bucket count exceeds the threshold, reset immediately; but if it's been less than K transactions since the last reset, increase the threshold (by say 10%). I think K can be a constant; somewhere between 10 and 100 would probably work. At process start, we should act as though the last reset were more than K transactions ago (i.e., don't increase the threshold at the first reset). The main advantage this has over v7 is that we don't have the 1000-transaction delay before reset, which ISTM is giving up much of the benefit of the whole idea. Also, if the session is consistently using lots of locks, we'll adapt to that after awhile and not do useless table resets. regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Tue, 23 Jul 2019 at 15:47, Tsunakawa, Takayuki wrote: > OTOH, how about my original patch that is based on the local lock list? I > expect that it won't that significant slowdown in the same test case. If > it's not satisfactory, then v6 is the best to commit. I think we need to move beyond the versions that have been reviewed and rejected. I don't think the reasons for their rejection will have changed. I've attached v7, which really is v6 with some comments adjusted and the order of the hash_get_num_entries and hash_get_max_bucket function calls swapped. I think hash_get_num_entries() will return 0 most of the time where we're calling it, so it makes sense to put the condition that's less likely to be true first in the if condition. I'm pretty happy with v7 now. If anyone has any objections to it, please speak up very soon. Otherwise, I plan to push it about this time tomorrow. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable_v7.patch Description: Binary data
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > Another counter-argument to this is that there's already an > unexplainable slowdown after you run a query which obtains a large > number of locks in a session or use prepared statements and a > partitioned table with the default plan_cache_mode setting. Are we not > just righting a wrong here? Albeit, possibly 1000 queries later. > > I am, of course, open to other ideas which solve the problem that v5 > has, but failing that, I don't see v6 as all that bad. At least all > the logic is contained in one function. I know Tom wanted to steer > clear of heuristics to reinitialise the table, but most of the stuff > that was in the patch back when he complained was trying to track the > average number of locks over the previous N transactions, and his > concerns were voiced before I showed the 7% performance regression > with unconditionally rebuilding the table. I think I understood what you mean. Sorry, I don't have a better idea. This unexplanability is probably what we should accept to avoid the shocking 7% slowdown. OTOH, how about my original patch that is based on the local lock list? I expect that it won't that significant slowdown in the same test case. If it's not satisfactory, then v6 is the best to commit. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki wrote: > > From: David Rowley [mailto:david.row...@2ndquadrant.com] > > I went back to the drawing board on this and I've added some code that > > counts > > the number of times we've seen the table to be oversized and just shrinks > > the table back down on the 1000th time. 6.93% / 1000 is not all that much. > > I'm afraid this kind of hidden behavior would appear mysterious to users. > They may wonder "Why is the same query fast at first in the session (5 or 6 > times of execution), then gets slower for a while, and gets faster again? Is > there something to tune? Am I missing something wrong with my app (e.g. how > to use prepared statements)?" So I prefer v5. Another counter-argument to this is that there's already an unexplainable slowdown after you run a query which obtains a large number of locks in a session or use prepared statements and a partitioned table with the default plan_cache_mode setting. Are we not just righting a wrong here? Albeit, possibly 1000 queries later. I am, of course, open to other ideas which solve the problem that v5 has, but failing that, I don't see v6 as all that bad. At least all the logic is contained in one function. I know Tom wanted to steer clear of heuristics to reinitialise the table, but most of the stuff that was in the patch back when he complained was trying to track the average number of locks over the previous N transactions, and his concerns were voiced before I showed the 7% performance regression with unconditionally rebuilding the table. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 22 Jul 2019 at 16:36, Tsunakawa, Takayuki wrote: > > From: David Rowley [mailto:david.row...@2ndquadrant.com] > > For the use case we've been measuring with partitioned tables and the > > generic plan generation causing a sudden spike in the number of > > obtained locks, then having plan_cache_mode = force_custom_plan will > > cause the lock table not to become bloated. I'm not sure there's > > anything interesting to measure there. > > I meant the difference between the following two cases, where the query only > touches one partition (e.g. SELECT ... WHERE pkey = value): > > * plan_cache_mode=force_custom_plan: LocalLockHash won't bloat. The query > execution time is steady. > > * plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to > the creation of the generic plan. The generic plan is not adopted because > its cost is high. Later executions of the query will suffer from the bloat > until the 1006th execution when LocalLockHash is shrunk. I measured this again in https://www.postgresql.org/message-id/CAKJS1f_ycJ-6QTKC70pZRYdwsAwUo+t0_CV0eXC=j-b5a2m...@mail.gmail.com where I posted the v6 patch. It's the final results in the email. I didn't measure plan_cache_mode = force_custom_plan. There'd be no lock table bloat in that case and the additional overhead would just be from the hash_get_num_entries() && hash_get_max_bucket() calls, which the first results show next to no overhead for. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > For the use case we've been measuring with partitioned tables and the > generic plan generation causing a sudden spike in the number of > obtained locks, then having plan_cache_mode = force_custom_plan will > cause the lock table not to become bloated. I'm not sure there's > anything interesting to measure there. I meant the difference between the following two cases, where the query only touches one partition (e.g. SELECT ... WHERE pkey = value): * plan_cache_mode=force_custom_plan: LocalLockHash won't bloat. The query execution time is steady. * plan_cache_mode=auto: LocalLockHash bloats on the sixth execution due to the creation of the generic plan. The generic plan is not adopted because its cost is high. Later executions of the query will suffer from the bloat until the 1006th execution when LocalLockHash is shrunk. Depending on the number of transactions and what each transaction does, I thought the difference will be noticeable or not. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 22 Jul 2019 at 14:21, Tsunakawa, Takayuki wrote: > > From: David Rowley [mailto:david.row...@2ndquadrant.com] > > I personally don't think that's true. The only way you'll notice the > > LockReleaseAll() overhead is to execute very fast queries with a > > bloated lock table. It's pretty hard to notice that a single 0.1ms > > query is slow. You'll need to execute thousands of them before you'll > > be able to measure it, and once you've done that, the lock shrink code > > will have run and the query will be performing optimally again. > > Maybe so. Will the difference be noticeable between plan_cache_mode=auto > (default) and plan_cache_mode=custom? For the use case we've been measuring with partitioned tables and the generic plan generation causing a sudden spike in the number of obtained locks, then having plan_cache_mode = force_custom_plan will cause the lock table not to become bloated. I'm not sure there's anything interesting to measure there. The only additional code that gets executed is the hash_get_num_entries() and possibly hash_get_max_bucket. Maybe it's worth swapping the order of those calls since most of the time the entry will be 0 and the max bucket count threshold won't be exceeded. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > I personally don't think that's true. The only way you'll notice the > LockReleaseAll() overhead is to execute very fast queries with a > bloated lock table. It's pretty hard to notice that a single 0.1ms > query is slow. You'll need to execute thousands of them before you'll > be able to measure it, and once you've done that, the lock shrink code > will have run and the query will be performing optimally again. Maybe so. Will the difference be noticeable between plan_cache_mode=auto (default) and plan_cache_mode=custom? > I voice my concerns with v5 and I wasn't really willing to push it > with a known performance regression of 7% in a fairly valid case. v6 > does not suffer from that. You're right. We may have to consider the unpredictability to users by this hidden behavior as a compromise for higher throughput. > > Where else does the extra overhead come from? > > hash_get_num_entries(LockMethodLocalHash) == 0 && > + hash_get_max_bucket(LockMethodLocalHash) > > + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD) > > that's executed every time, not every 1000 times. I see. Thanks. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 22 Jul 2019 at 12:48, Tsunakawa, Takayuki wrote: > > From: David Rowley [mailto:david.row...@2ndquadrant.com] > > I went back to the drawing board on this and I've added some code that > > counts > > the number of times we've seen the table to be oversized and just shrinks > > the table back down on the 1000th time. 6.93% / 1000 is not all that much. > > I'm afraid this kind of hidden behavior would appear mysterious to users. > They may wonder "Why is the same query fast at first in the session (5 or 6 > times of execution), then gets slower for a while, and gets faster again? Is > there something to tune? Am I missing something wrong with my app (e.g. how > to use prepared statements)?" So I prefer v5. I personally don't think that's true. The only way you'll notice the LockReleaseAll() overhead is to execute very fast queries with a bloated lock table. It's pretty hard to notice that a single 0.1ms query is slow. You'll need to execute thousands of them before you'll be able to measure it, and once you've done that, the lock shrink code will have run and the query will be performing optimally again. I voice my concerns with v5 and I wasn't really willing to push it with a known performance regression of 7% in a fairly valid case. v6 does not suffer from that. > > Of course, not all the extra overhead might be from rebuilding the table, > > so here's a test with the updated patch. > > Where else does the extra overhead come from? hash_get_num_entries(LockMethodLocalHash) == 0 && + hash_get_max_bucket(LockMethodLocalHash) > + LOCKMETHODLOCALHASH_SHRINK_THRESHOLD) that's executed every time, not every 1000 times. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > I went back to the drawing board on this and I've added some code that counts > the number of times we've seen the table to be oversized and just shrinks > the table back down on the 1000th time. 6.93% / 1000 is not all that much. I'm afraid this kind of hidden behavior would appear mysterious to users. They may wonder "Why is the same query fast at first in the session (5 or 6 times of execution), then gets slower for a while, and gets faster again? Is there something to tune? Am I missing something wrong with my app (e.g. how to use prepared statements)?" So I prefer v5. > Of course, not all the extra overhead might be from rebuilding the table, > so here's a test with the updated patch. Where else does the extra overhead come from? Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Thu, 18 Jul 2019 at 14:53, David Rowley wrote: > Is anyone particularly concerned about the worst-case slowdown here > being about 1.54%? The best case, and arguably a more realistic case > above showed a 34% speedup for the best case. I took a bit more time to test the performance on this. I thought I might have been a bit unfair on the patch by giving it completely empty tables to look at. It just seems too unrealistic to have a large number of completely empty partitions. I decided to come up with a more realistic case where there are 140 partitions but we're performing an index scan to find a record that can exist in only 1 of the 140 partitions. create table hp (a int, b int) partition by hash(a); select 'create table hp'||x||' partition of hp for values with (modulus 140, remainder ' || x || ');' from generate_series(0,139)x; create index on hp (b); insert into hp select x,x from generate_series(1, 14) x; analyze hp; select3.sql: select * from hp where b = 1 Master: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2124.591367 (excluding connections establishing) tps = 2158.754837 (excluding connections establishing) tps = 2146.348465 (excluding connections establishing) tps = 2148.995800 (excluding connections establishing) tps = 2154.223600 (excluding connections establishing) Patched: $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2002.480729 (excluding connections establishing) tps = 1997.272944 (excluding connections establishing) tps = 1992.527079 (excluding connections establishing) tps = 1995.789367 (excluding connections establishing) tps = 2001.195760 (excluding connections establishing) so it turned out it's even slower, and not by a small amount either! Patched is 6.93% slower on average with this case :-( I went back to the drawing board on this and I've added some code that counts the number of times we've seen the table to be oversized and just shrinks the table back down on the 1000th time. 6.93% / 1000 is not all that much. Of course, not all the extra overhead might be from rebuilding the table, so here's a test with the updated patch. $ pgbench -n -f select3.sql -T 60 -M prepared postgres tps = 2142.414323 (excluding connections establishing) tps = 2139.666899 (excluding connections establishing) tps = 2138.744789 (excluding connections establishing) tps = 2138.300299 (excluding connections establishing) tps = 2137.346278 (excluding connections establishing) Just a 0.34% drop. Pretty hard to pick that out the noise. Testing the original case that shows the speedup: create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p Master: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 10172.035036 (excluding connections establishing) tps = 10192.780529 (excluding connections establishing) tps = 10331.306003 (excluding connections establishing) Patched: $ pgbench -n -f select.sql -T 60 -M prepared postgres tps = 15080.765549 (excluding connections establishing) tps = 14994.404069 (excluding connections establishing) tps = 14982.923480 (excluding connections establishing) That seems fine, 46% faster. v6 is attached. I plan to push this in a few days unless someone objects. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable_v6.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Thu, 27 Jun 2019 at 12:59, Tsunakawa, Takayuki wrote: > > From: David Rowley [mailto:david.row...@2ndquadrant.com] > Thank you, looks good. I find it ready for committer (I noticed the status > is already set so.) Thanks for looking. I've just been looking at this again and I thought I'd better check the performance of the worst case for the patch, where the hash table is rebuilt each query. To do this I first created a single column 70 partition partitioned table ("p") and left it empty. I then checked the performance of: SELECT * FROM p; Having 70 partitions means that the lock table's max bucket goes over the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD which is set to 64 and results in the table being rebuilt each time the query is run. The performance was as follows: 70 partitions: LOCKMETHODLOCALHASH_SHRINK_THRESHOLD = 64 master + shrink_bloated_locallocktable_v5.patch: ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres tps = 8427.053378 (excluding connections establishing) tps = 8583.251821 (excluding connections establishing) tps = 8569.587268 (excluding connections establishing) tps = 8552.988483 (excluding connections establishing) tps = 8527.735108 (excluding connections establishing) master (93907478): ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres tps = 8712.919411 (excluding connections establishing) tps = 8760.190372 (excluding connections establishing) tps = 8755.069470 (excluding connections establishing) tps = 8747.389735 (excluding connections establishing) tps = 8758.275202 (excluding connections establishing) patched is 2.45% slower If I increase the partition count to 140 and put the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up to 128, then the performance is as follows: master + shrink_bloated_locallocktable_v5.patch: ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres tps = 2548.917856 (excluding connections establishing) tps = 2561.283564 (excluding connections establishing) tps = 2549.669870 (excluding connections establishing) tps = 2421.971864 (excluding connections establishing) tps = 2428.983660 (excluding connections establishing) Master (93907478): ubuntu@ip-10-0-0-201:~$ pgbench -n -T 60 -f select1.sql -M prepared postgres tps = 2605.407529 (excluding connections establishing) tps = 2600.691426 (excluding connections establishing) tps = 2594.123983 (excluding connections establishing) tps = 2455.745644 (excluding connections establishing) tps = 2450.061483 (excluding connections establishing) patched is 1.54% slower I'd rather not put the LOCKMETHODLOCALHASH_SHRINK_THRESHOLD up any higher than 128 since it can detract from the improvement we're trying to make with this patch. Now, this case of querying a partitioned table that happens to be completely empty seems a bit unrealistic. Something more realistic might be index scanning all partitions to find a value that only exists in a single partition. Assuming the partitions actually have some records, then that's going to be a more expensive query, so the overhead of rebuilding the table will be less noticeable. A previous version of the patch has already had some heuristics to try to only rebuild the hash table when it's likely beneficial. I'd rather not go exploring in that area again. Is anyone particularly concerned about the worst-case slowdown here being about 1.54%? The best case, and arguably a more realistic case above showed a 34% speedup for the best case. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] v5 is attached. Thank you, looks good. I find it ready for committer (I noticed the status is already set so.) Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 17 Jun 2019 at 15:05, Tsunakawa, Takayuki wrote: > (1) > +#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64 > > How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the > threshold value to trigger shrinkage? Code in PostgreSQL seems to use the > term threshold. That's probably better. I've renamed it to that. > (2) > +/* Complain if the above are not set to something sane */ > +#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE > +#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE" > +#endif > > I don't think these are necessary, because these are fixed and not > configurable. FYI, src/include/utils/memutils.h doesn't have #error to test > these macros. Yeah. I was thinking it was overkill when I wrote it, but somehow couldn't bring myself to remove it. Done now. > (3) > + if (hash_get_num_entries(LockMethodLocalHash) == 0 && > + hash_get_max_bucket(LockMethodLocalHash) > > + LOCKMETHODLOCALHASH_SHRINK_SIZE) > + CreateLocalLockHash(); > > I get an impression that Create just creates something where there's nothing. > How about Init or Recreate? Renamed to InitLocalLoclHash() v5 is attached. Thank you for the review. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable_v5.patch Description: Binary data
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > I've revised the patch to add a new constant named > LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash Thank you, and good performance. The patch passed make check. I'm OK with the current patch, but I have a few comments. Please take them as you see fit (I wouldn't mind if you don't.) (1) +#define LOCKMETHODLOCALHASH_SHRINK_SIZE 64 How about LOCKMETHODLOCALHASH_SHRINK_THRESHOLD, because this determines the threshold value to trigger shrinkage? Code in PostgreSQL seems to use the term threshold. (2) +/* Complain if the above are not set to something sane */ +#if LOCKMETHODLOCALHASH_SHRINK_SIZE < LOCKMETHODLOCALHASH_INIT_SIZE +#error "invalid LOCKMETHODLOCALHASH_SHRINK_SIZE" +#endif I don't think these are necessary, because these are fixed and not configurable. FYI, src/include/utils/memutils.h doesn't have #error to test these macros. #define ALLOCSET_DEFAULT_MINSIZE 0 #define ALLOCSET_DEFAULT_INITSIZE (8 * 1024) #define ALLOCSET_DEFAULT_MAXSIZE (8 * 1024 * 1024) (3) + if (hash_get_num_entries(LockMethodLocalHash) == 0 && + hash_get_max_bucket(LockMethodLocalHash) > + LOCKMETHODLOCALHASH_SHRINK_SIZE) + CreateLocalLockHash(); I get an impression that Create just creates something where there's nothing. How about Init or Recreate? Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 8 Apr 2019 at 04:09, Tom Lane wrote: > Also, I would not define "significantly bloated" as "the table has > grown at all". I think the threshold ought to be at least ~100 > buckets, if we're starting at 16. I've revised the patch to add a new constant named LOCKMETHODLOCALHASH_SHRINK_SIZE. I've set this to 64 for now. Once the hash table grows over that size we shrink it back down to LOCKMETHODLOCALHASH_INIT_SIZE, which I've kept at 16. I'm not opposed to setting it to 128. For this particular benchmark, it won't make any difference as it's only going to affect something that does not quite use 128 locks and has to work with a slightly bloated local lock table. I think hitting 64 locks in a transaction is a good indication that it's not a simple transaction so users are probably unlikely to notice the small slowdown from the hash table reinitialisation. Since quite a bit has changed around partition planning lately, I've taken a fresh set of benchmarks on today's master. I'm using something very close to Amit's benchmark from upthread. I just changed the query so we hit the same partition each time instead of a random one. create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0,8191) x; \gexec select.sql: \set p 1 select * from ht where a = :p master @ a193cbec119 + shrink_bloated_locallocktable_v4.patch: plan_cache_mode = 'auto'; ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 14101.226982 (excluding connections establishing) tps = 14034.250962 (excluding connections establishing) tps = 14107.937755 (excluding connections establishing) plan_cache_mode = 'force_custom_plan'; ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 14240.366770 (excluding connections establishing) tps = 14272.244886 (excluding connections establishing) tps = 14130.684315 (excluding connections establishing) master @ a193cbec119: plan_cache_mode = 'auto'; ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 10467.027666 (excluding connections establishing) tps = 10333.700917 (excluding connections establishing) tps = 10633.084426 (excluding connections establishing) plan_cache_mode = 'force_custom_plan'; ubuntu@ip-10-0-0-201:~$ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 13938.083272 (excluding connections establishing) tps = 14143.241802 (excluding connections establishing) tps = 14097.406758 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable_v4.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 2019-04-08 05:46, Tsunakawa, Takayuki wrote: > I'm registering you as another author and me as a reviewer, and marking this > ready for committer. Moved to next commit fest. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > It would be good to get your view on the > shrink_bloated_locallocktable_v3.patch I worked on last night. I was > unable to measure any overhead to solving the problem that way. Thanks, it looks super simple and good. I understood the idea behind your patch is: * Transactions that touch many partitions and/or tables are a special event and not normal, and the hash table bloat is an unlucky accident. So it's reasonable to revert the bloated hash table back to the original size. * Repeated transactions that acquire many locks have to enlarge the hash table every time. However, the overhead of hash table expansion should be hidden behind other various processing (acquiring/releasing locks, reading/writing the relations, accessing the catalogs of those relations) TBH, I think the linked list approach feels more intuitive because the resulting code looks what it wants to do (efficiently iterate over acquired locks) and is based on the classic book. But your approach seems to relieve people. So I'm OK with your patch. I'm registering you as another author and me as a reviewer, and marking this ready for committer. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 8 Apr 2019 at 14:54, Tsunakawa, Takayuki wrote: > > From: 'Andres Freund' [mailto:and...@anarazel.de] > > Did you see that people measured slowdowns? > > Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels > like a somewhat extreme test case. And that might be within noise as was > mentioned. > > If we want to remove even the noise, we may have to think of removing the > LocalLockHash completely. But it doesn't seem feasible... It would be good to get your view on the shrink_bloated_locallocktable_v3.patch I worked on last night. I was unable to measure any overhead to solving the problem that way. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: 'Andres Freund' [mailto:and...@anarazel.de] > On 2019-04-08 02:28:12 +, Tsunakawa, Takayuki wrote: > > I think the linked list of LOCALLOCK approach is natural, simple, and > > good. > > Did you see that people measured slowdowns? Yeah, 0.5% decrease with pgbench -M prepared -S (select-only), which feels like a somewhat extreme test case. And that might be within noise as was mentioned. If we want to remove even the noise, we may have to think of removing the LocalLockHash completely. But it doesn't seem feasible... Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi, On 2019-04-08 02:28:12 +, Tsunakawa, Takayuki wrote: > I think the linked list of LOCALLOCK approach is natural, simple, and > good. Did you see that people measured slowdowns? Greetings, Andres Freund
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Tom Lane [mailto:t...@sss.pgh.pa.us] > On the whole I don't think there's an adequate case for committing > this patch. From: Andres Freund [mailto:and...@anarazel.de] > On 2019-04-05 23:03:11 -0400, Tom Lane wrote: > > If I reduce the number of partitions in Amit's example from 8192 > > to something more real-world, like 128, I do still measure a > > performance gain, but it's ~ 1.5% which is below what I'd consider > > a reproducible win. I'm accustomed to seeing changes up to 2% > > in narrow benchmarks like this one, even when "nothing changes" > > except unrelated code. > > I'm not sure it's actually that narrow these days. With all the > partitioning improvements happening, the numbers of locks commonly held > are going to rise. And while 8192 partitions is maybe on the more > extreme side, it's a workload with only a single table, and plenty > workloads touch more than a single partitioned table. I would feel happy if I could say such a many-partitions use case is narrow or impractical and ignore it, but it's not narrow. Two of our customers are actually requesting such usage: one uses 5,500 partitions and is trying to migrate from a commercial database on Linux, and the other requires 200,000 partitions to migrate from a legacy database on a mainframe. At first, I thought such many partitions indicate a bad application design, but it sounded valid (or at least I can't insist that's bad). PostgreSQL is now expected to handle such huge workloads. From: Andres Freund [mailto:and...@anarazel.de] > I'm not sure I'm quite that concerned. For one, a good bit of that space > was up for grabs until the recent reordering of LOCALLOCK and nobody > complained. But more importantly, I think commonly the amount of locks > around is fairly constrained, isn't it? We can't really have that many > concurrently held locks, due to the shared memory space, and the size of > a LOCALLOCK isn't that big compared to say relcache entries. We also > probably fairly easily could win some space back - e.g. make nLocks 32 > bits. +1 From: Tom Lane [mailto:t...@sss.pgh.pa.us] > I'd also point out that this is hardly the only place where we've > seen hash_seq_search on nearly-empty hash tables become a bottleneck. > So I'm not thrilled about attacking that with one-table-at-time patches. > I'd rather see us do something to let hash_seq_search win across > the board. > > I spent some time wondering whether we could adjust the data structure > so that all the live entries in a hashtable are linked into one chain, > but I don't quite see how to do it without adding another list link to > struct HASHELEMENT, which seems pretty expensive. I think the linked list of LOCALLOCK approach is natural, simple, and good. In the Jim Gray's classic book "Transaction processing: concepts and techniques", we can find the following sentence in "8.4.5 Lock Manager Internal Logic." The sample implementation code in the book uses a similar linked list to remember and release a transaction's acquired locks. "All the locks of a transaction are kept in a list so they can be quickly found and released at commit or rollback." And handling this issue with the LOCALLOCK linked list is more natural than with the hash table resize. We just want to quickly find all grabbed locks, so we use a linked list. A hash table is a mechanism to find a particular item quickly. So it was merely wrong to use the hash table to iterate all grabbed locks. Also, the hash table got big because some operation in the session needed it, and some subsequent operations in the same session may need it again. So we wouldn't be relieved with shrinking the hash table. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 8 Apr 2019 at 04:09, Tom Lane wrote: > Um ... I don't see where you're destroying the old hash? In CreateLocalLockHash. > Also, I entirely dislike wiring in assumptions about hash_seq_search's > private state structure here. I think it's worth having an explicit > entry point in dynahash.c to get the current number of buckets. Okay. Added hash_get_max_bucket() > Also, I would not define "significantly bloated" as "the table has > grown at all". I think the threshold ought to be at least ~100 > buckets, if we're starting at 16. I wouldn't either. I don't think the comment says that. It says there can be slowdowns when its significantly bloated, and then goes on to say that we just resize when it's bigger than standard. > Probably we ought to try to gather some evidence to inform the > choice of cutoff here. Maybe instrument the regression tests to > see how big the table typically gets? In partition_prune.sql I see use of a bucket as high as 285 on my machine with: drop table lp, coll_pruning, rlp, mc3p, mc2p, boolpart, rp, coll_pruning_multi, like_op_noprune, lparted_by_int2, rparted_by_int2; I've not added any sort of cut-off though as I benchmarked it and surprisingly I don't see any slowdown with the worst case. So I'm thinking there might not be any point. alter system set plan_cache_mode = 'force_generic_plan'; create table hp (a int primary key) partition by hash (a); select 'create table hp' || x::text || ' partition of hp for values with (modulus 32, remainder ' || (x)::text || ');' from generate_series(0,31) x; \gexec select.sql \set p 1 select * from hp where a = :p Master $ pgbench -n -M prepared -f select.sql -T 60 postgres tps = 11834.764309 (excluding connections establishing) tps = 12279.212223 (excluding connections establishing) tps = 12007.263547 (excluding connections establishing) Patched: $ pgbench -n -M prepared -f select.sql -T 60 postgres tps = 13380.684817 (excluding connections establishing) tps = 12790.999279 (excluding connections establishing) tps = 12568.892558 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable_v3.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
David Rowley writes: > Okay. Here's another version with all the average locks code removed > that only recreates the table when it's completely empty. Um ... I don't see where you're destroying the old hash? Also, I entirely dislike wiring in assumptions about hash_seq_search's private state structure here. I think it's worth having an explicit entry point in dynahash.c to get the current number of buckets. Also, I would not define "significantly bloated" as "the table has grown at all". I think the threshold ought to be at least ~100 buckets, if we're starting at 16. Probably we ought to try to gather some evidence to inform the choice of cutoff here. Maybe instrument the regression tests to see how big the table typically gets? regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 8 Apr 2019 at 03:47, Andres Freund wrote: > Could you benchmark your adversarial case? Which case? I imagine the worst case for v2 is a query that just constantly asks for over 16 locks. Perhaps a prepared plan, so not to add planner overhead. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi, On 2019-04-08 03:40:52 +1200, David Rowley wrote: > On Mon, 8 Apr 2019 at 03:20, Tom Lane wrote: > > > > David Rowley writes: > > > The reason I thought it was a good idea to track some history there > > > was to stop the lock table constantly being shrunk back to the default > > > size every time a simple single table query was executed. > > > > I think that's probably gilding the lily, considering that this whole > > issue is pretty new. There's no evidence that expanding the local > > lock table is a significant drag on queries that need a lot of locks. > > Okay. Here's another version with all the average locks code removed > that only recreates the table when it's completely empty. Could you benchmark your adversarial case? - Andres
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 8 Apr 2019 at 03:20, Tom Lane wrote: > > David Rowley writes: > > The reason I thought it was a good idea to track some history there > > was to stop the lock table constantly being shrunk back to the default > > size every time a simple single table query was executed. > > I think that's probably gilding the lily, considering that this whole > issue is pretty new. There's no evidence that expanding the local > lock table is a significant drag on queries that need a lot of locks. Okay. Here's another version with all the average locks code removed that only recreates the table when it's completely empty. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable_v2.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
David Rowley writes: > On Mon, 8 Apr 2019 at 02:59, Tom Lane wrote: >> We *should* be using hash_get_num_entries(), but only to verify >> that the table is empty before resetting it. The additional bit >> that is needed is to see whether the number of buckets is large >> enough to justify calling the table bloated. > The reason I thought it was a good idea to track some history there > was to stop the lock table constantly being shrunk back to the default > size every time a simple single table query was executed. I think that's probably gilding the lily, considering that this whole issue is pretty new. There's no evidence that expanding the local lock table is a significant drag on queries that need a lot of locks. regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 8 Apr 2019 at 02:59, Tom Lane wrote: > > David Rowley writes: > > hash_get_num_entries() looks cheap enough to me. Can you explain why > > you think that's too expensive? > > What I objected to cost-wise was counting the number of lock > acquisitions/releases, which seems entirely beside the point. > > We *should* be using hash_get_num_entries(), but only to verify > that the table is empty before resetting it. The additional bit > that is needed is to see whether the number of buckets is large > enough to justify calling the table bloated. The reason I thought it was a good idea to track some history there was to stop the lock table constantly being shrunk back to the default size every time a simple single table query was executed. For example, a workload repeatably doing: SELECT * FROM table_with_lots_of_partitions; SELECT * FROM non_partitioned_table; I was worried that obtaining locks on the partitioned table would become a little slower because it would have to expand the hash table each time the query is executed. > > As cheap as possible sounds good, but I'm confused at why you think > > the table will always be empty at the end of transaction. > > It's conceivable that it won't be, which is why we need a test. > I'm simply arguing that if it is not, we can just postpone de-bloating > till it is. Session-level locks are so rarely used that there's no > need to sweat about that case. That seems fair. It would certainly simplify the patch. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
David Rowley writes: > On Mon, 8 Apr 2019 at 02:20, Tom Lane wrote: >> I like the concept ... but the particular implementation, not so much. >> It seems way overcomplicated. In the first place, why should we >> add code to copy entries? Just don't do it except when the table >> is empty. In the second, I think we could probably have a far >> cheaper test for how big the table is --- maybe we'd need to >> expose some function in dynahash.c, but the right way here is just >> to see how many buckets there are. I don't like adding statistics >> counting for this, because it's got basically nothing to do with >> what the actual problem is. (If you acquire and release one lock, >> and do that over and over, you don't have a bloat problem no >> matter how many times you do it.) > hash_get_num_entries() looks cheap enough to me. Can you explain why > you think that's too expensive? What I objected to cost-wise was counting the number of lock acquisitions/releases, which seems entirely beside the point. We *should* be using hash_get_num_entries(), but only to verify that the table is empty before resetting it. The additional bit that is needed is to see whether the number of buckets is large enough to justify calling the table bloated. > As cheap as possible sounds good, but I'm confused at why you think > the table will always be empty at the end of transaction. It's conceivable that it won't be, which is why we need a test. I'm simply arguing that if it is not, we can just postpone de-bloating till it is. Session-level locks are so rarely used that there's no need to sweat about that case. regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 8 Apr 2019 at 02:36, David Rowley wrote: > > LockMethodLocalHash is special in that it predictably goes to empty > > at the end of every transaction, so that de-bloating at that point > > is a workable strategy. I think we'd probably need something more > > robust if we were trying to fix this generally for all hash tables. > > But if we're going to go with the one-off hack approach, we should > > certainly try to keep that hack as simple as possible. > > As cheap as possible sounds good, but I'm confused at why you think > the table will always be empty at the end of transaction. It's my > understanding and I see from debugging that session level locks remain > in there. If I don't copy those into the new table they'll be lost. Or we could just skip the table recreation if there are no session-levels. That would require calling hash_get_num_entries() on the table again and just recreating the table if there are 0 locks. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 8 Apr 2019 at 02:20, Tom Lane wrote: > I like the concept ... but the particular implementation, not so much. > It seems way overcomplicated. In the first place, why should we > add code to copy entries? Just don't do it except when the table > is empty. In the second, I think we could probably have a far > cheaper test for how big the table is --- maybe we'd need to > expose some function in dynahash.c, but the right way here is just > to see how many buckets there are. I don't like adding statistics > counting for this, because it's got basically nothing to do with > what the actual problem is. (If you acquire and release one lock, > and do that over and over, you don't have a bloat problem no > matter how many times you do it.) hash_get_num_entries() looks cheap enough to me. Can you explain why you think that's too expensive? > LockMethodLocalHash is special in that it predictably goes to empty > at the end of every transaction, so that de-bloating at that point > is a workable strategy. I think we'd probably need something more > robust if we were trying to fix this generally for all hash tables. > But if we're going to go with the one-off hack approach, we should > certainly try to keep that hack as simple as possible. As cheap as possible sounds good, but I'm confused at why you think the table will always be empty at the end of transaction. It's my understanding and I see from debugging that session level locks remain in there. If I don't copy those into the new table they'll be lost. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
David Rowley writes: > On Sat, 6 Apr 2019 at 16:03, Tom Lane wrote: >> My own thought about how to improve this situation was just to destroy >> and recreate LockMethodLocalHash at transaction end (or start) >> if its size exceeded $some-value. Leaving it permanently bloated seems >> like possibly a bad idea, even if we get rid of all the hash_seq_searches >> on it. > Which I thought was an okay idea. I think the one advantage that > would have over making hash_seq_search() faster for large and mostly > empty tables is that over-sized hash tables are just not very cache > efficient, and if we don't need it to be that large then we should > probably consider making it smaller again. > I've had a go at implementing this and using Amit's benchmark the > performance looks pretty good. I can't detect any slowdown for the > general case. I like the concept ... but the particular implementation, not so much. It seems way overcomplicated. In the first place, why should we add code to copy entries? Just don't do it except when the table is empty. In the second, I think we could probably have a far cheaper test for how big the table is --- maybe we'd need to expose some function in dynahash.c, but the right way here is just to see how many buckets there are. I don't like adding statistics counting for this, because it's got basically nothing to do with what the actual problem is. (If you acquire and release one lock, and do that over and over, you don't have a bloat problem no matter how many times you do it.) LockMethodLocalHash is special in that it predictably goes to empty at the end of every transaction, so that de-bloating at that point is a workable strategy. I think we'd probably need something more robust if we were trying to fix this generally for all hash tables. But if we're going to go with the one-off hack approach, we should certainly try to keep that hack as simple as possible. regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Sat, 6 Apr 2019 at 16:03, Tom Lane wrote: > I'd also point out that this is hardly the only place where we've > seen hash_seq_search on nearly-empty hash tables become a bottleneck. > So I'm not thrilled about attacking that with one-table-at-time patches. > I'd rather see us do something to let hash_seq_search win across > the board. Rewinding back to mid-Feb: You wrote: > My own thought about how to improve this situation was just to destroy > and recreate LockMethodLocalHash at transaction end (or start) > if its size exceeded $some-value. Leaving it permanently bloated seems > like possibly a bad idea, even if we get rid of all the hash_seq_searches > on it. Which I thought was an okay idea. I think the one advantage that would have over making hash_seq_search() faster for large and mostly empty tables is that over-sized hash tables are just not very cache efficient, and if we don't need it to be that large then we should probably consider making it smaller again. I've had a go at implementing this and using Amit's benchmark the performance looks pretty good. I can't detect any slowdown for the general case. master: plan_cache_mode = auto: $ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 9373.698212 (excluding connections establishing) tps = 9356.993148 (excluding connections establishing) tps = 9367.579806 (excluding connections establishing) plan_cache_mode = force_custom_plan: $ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 12863.758185 (excluding connections establishing) tps = 12787.766054 (excluding connections establishing) tps = 12817.878940 (excluding connections establishing) shrink_bloated_locallocktable.patch: plan_cache_mode = auto: $ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 12756.021211 (excluding connections establishing) tps = 12800.939518 (excluding connections establishing) tps = 12804.501977 (excluding connections establishing) plan_cache_mode = force_custom_plan: $ pgbench -n -M prepared -T 60 -f select.sql postgres tps = 12763.448836 (excluding connections establishing) tps = 12901.673271 (excluding connections establishing) tps = 12856.512745 (excluding connections establishing) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services shrink_bloated_locallocktable.patch Description: Binary data
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 2019-04-06 05:03, Tom Lane wrote: > Trying a standard pgbench test case (pgbench -M prepared -S with > one client and an -s 10 database), it seems that the patch is about > 0.5% slower than HEAD. Again, that's below the noise threshold, > but it's not promising for the net effects of this patch on workloads > that aren't specifically about large and prunable partition sets. In my testing, I've also noticed that it seems to be slightly on the slower side for these simple tests. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Andres Freund writes: > I wonder if one approach to solve this wouldn't be to just make the > hashtable drastically smaller. Right now we'll often have have lots > empty entries that are 72 bytes + dynahash overhead. That's plenty of > memory that needs to be skipped over. I wonder if we instead should > have an array of held locallocks, and a hashtable with {hashcode, > offset_in_array} + custom comparator for lookups. Well, that's not going to work all that well for retail lock releases; you'll end up with holes in the array, maybe a lot of them. However, it led me to think of another way we might approach the general hashtable problem: right now, we are not making any use of the fact that the hashtable's entries are laid out in big slabs (arrays). What if we tried to ensure that the live entries are allocated fairly compactly in those arrays, and then implemented hash_seq_search as a scan over the arrays, ignoring the hash bucket structure per se? We'd need a way to reliably tell a live entry from a free entry, but again, there's plenty of space for a flag bit or two. This might perform poorly if you allocated a bunch of entries, freed most-but-not-all, and then wanted to seqscan the remainder; you'd end up with the same problem I complained of above that you're iterating over an array that's mostly uninteresting. In principle we could keep count of the live vs free entries and dynamically decide to scan via the hash bucket structure instead of searching the storage array when the array is too sparse; but that might be overly complicated. I haven't tried to work this out in detail, it's just a late night brainstorm. But, again, I'd much rather solve this in dynahash.c than by layering some kind of hack on top of it. regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi, On 2019-04-05 23:03:11 -0400, Tom Lane wrote: > Peter Eisentraut writes: > > I can't detect any performance improvement with the patch applied to > > current master, using the test case from Yoshikazu Imai (2019-03-19). > > FWIW, I tried this patch against current HEAD (959d00e9d). > Using the test case described by Amit at > > I do measure an undeniable speedup, close to 35%. > > However ... I submit that that's a mighty extreme test case. > (I had to increase max_locks_per_transaction to get it to run > at all.) We should not be using that sort of edge case to drive > performance optimization choices. > > If I reduce the number of partitions in Amit's example from 8192 > to something more real-world, like 128, I do still measure a > performance gain, but it's ~ 1.5% which is below what I'd consider > a reproducible win. I'm accustomed to seeing changes up to 2% > in narrow benchmarks like this one, even when "nothing changes" > except unrelated code. I'm not sure it's actually that narrow these days. With all the partitioning improvements happening, the numbers of locks commonly held are going to rise. And while 8192 partitions is maybe on the more extreme side, it's a workload with only a single table, and plenty workloads touch more than a single partitioned table. > Trying a standard pgbench test case (pgbench -M prepared -S with > one client and an -s 10 database), it seems that the patch is about > 0.5% slower than HEAD. Again, that's below the noise threshold, > but it's not promising for the net effects of this patch on workloads > that aren't specifically about large and prunable partition sets. Yea, that's concerning. > I'm also fairly concerned about the effects of the patch on > sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88 > bytes, a 22% increase. That's a lot if you're considering cases > with many locks. I'm not sure I'm quite that concerned. For one, a good bit of that space was up for grabs until the recent reordering of LOCALLOCK and nobody complained. But more importantly, I think commonly the amount of locks around is fairly constrained, isn't it? We can't really have that many concurrently held locks, due to the shared memory space, and the size of a LOCALLOCK isn't that big compared to say relcache entries. We also probably fairly easily could win some space back - e.g. make nLocks 32 bits. I wonder if one approach to solve this wouldn't be to just make the hashtable drastically smaller. Right now we'll often have have lots empty entries that are 72 bytes + dynahash overhead. That's plenty of memory that needs to be skipped over. I wonder if we instead should have an array of held locallocks, and a hashtable with {hashcode, offset_in_array} + custom comparator for lookups. That'd mean we could either scan the array of locallocks at release (which'd need to skip over entries that have already been released), or we could scan the much smaller hashtable sequentially. I don't think the above idea is quite there, and I'm tired, but I thought it might still be worth bringing up. > I spent some time wondering whether we could adjust the data structure > so that all the live entries in a hashtable are linked into one chain, > but I don't quite see how to do it without adding another list link to > struct HASHELEMENT, which seems pretty expensive. Yea :( Greetings, Andres Freund
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Peter Eisentraut writes: > I can't detect any performance improvement with the patch applied to > current master, using the test case from Yoshikazu Imai (2019-03-19). FWIW, I tried this patch against current HEAD (959d00e9d). Using the test case described by Amit at I do measure an undeniable speedup, close to 35%. However ... I submit that that's a mighty extreme test case. (I had to increase max_locks_per_transaction to get it to run at all.) We should not be using that sort of edge case to drive performance optimization choices. If I reduce the number of partitions in Amit's example from 8192 to something more real-world, like 128, I do still measure a performance gain, but it's ~ 1.5% which is below what I'd consider a reproducible win. I'm accustomed to seeing changes up to 2% in narrow benchmarks like this one, even when "nothing changes" except unrelated code. Trying a standard pgbench test case (pgbench -M prepared -S with one client and an -s 10 database), it seems that the patch is about 0.5% slower than HEAD. Again, that's below the noise threshold, but it's not promising for the net effects of this patch on workloads that aren't specifically about large and prunable partition sets. I'm also fairly concerned about the effects of the patch on sizeof(LOCALLOCK) --- on a 64-bit machine it goes from 72 to 88 bytes, a 22% increase. That's a lot if you're considering cases with many locks. On the whole I don't think there's an adequate case for committing this patch. I'd also point out that this is hardly the only place where we've seen hash_seq_search on nearly-empty hash tables become a bottleneck. So I'm not thrilled about attacking that with one-table-at-time patches. I'd rather see us do something to let hash_seq_search win across the board. I spent some time wondering whether we could adjust the data structure so that all the live entries in a hashtable are linked into one chain, but I don't quite see how to do it without adding another list link to struct HASHELEMENT, which seems pretty expensive. I'll sketch the idea I had, just in case it triggers a better idea in someone else. Assuming we are willing to add another pointer to HASHELEMENT, use the two pointers to create a doubly-linked circular list that includes all live entries in the hashtable, with a list header in the hashtable's control area. (Maybe we'd use a dlist for this, but it's not essential.) Require this list to be organized so that all entries that belong to the same hash bucket are consecutive in the list, and make each non-null hash bucket header point to the first entry in the list for its bucket. To allow normal searches to detect when they've run through their bucket, add a flag to HASHELEMENT that is set only in entries that are the first, or perhaps last, of their bucket (so you'd detect end-of-bucket by checking the flag instead of testing for a null pointer). Adding a bool field is free due to alignment considerations, at least on 64-bit machines. Given this, I think normal hash operations are more-or-less the same cost as before, while hash_seq_search just has to follow the circular list. I tried to figure out how to do the same thing with a singly-linked instead of doubly-linked list, but it doesn't quite work: if you need to remove the first element of a bucket, you have no cheap way to find its predecessor in the overall list (which belongs to some other bucket, but you don't know which one). Maybe we could just mark such entries dead (there's plenty of room for another flag bit) and plan to clean them up later? But it's not clear how to ensure that they'd get cleaned up in any sort of timely fashion. Another issue is that probably none of this works for the partitioned hash tables we use for some of the larger shared-memory hashes. But I'm not sure we care about hash_seq_search for those, so maybe we just say those are a different data structure. regards, tom lane
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 2019-03-19 10:21, Tsunakawa, Takayuki wrote: > From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] >> Fixed. > > Rebased on HEAD. Do you need the dlist_foreach_modify() calls? You are not actually modifying the list structure. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, Apr 5, 2019 at 0:05 AM, Tsunakawa, Takayuki wrote: > From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > > I can't detect any performance improvement with the patch applied to > > current master, using the test case from Yoshikazu Imai (2019-03-19). > > That's strange... Peter, Imai-san, can you compare your test procedures? Just for make sure, I described my test procedures in detail. I install and setup HEAD and patched as follows. [HEAD(413ccaa)] (git pull) ./configure --prefix=/usr/local/pgsql-dev --enable-depend make clean make make install su postgres export PATH=/usr/local/pgsql-dev/bin:$PATH initdb -D /var/lib/pgsql/data-dev vi /var/lib/pgsql/data-dev/postgresql.conf port = 44201 plan_cache_mode = 'auto' or 'force_custom_plan' max_parallel_workers = 0 max_parallel_workers_per_gather = 0 max_locks_per_transaction = 4096 pg_ctl -D /var/lib/pgsql/data-dev start [HEAD(413ccaa) + patch] (git pull) patch -u -p1 < 0002.patch ./configure --prefix=/usr/local/pgsql-locallock --enable-depend make clean make make install su postgres export PATH=/usr/local/pgsql-locallock/bin:$PATH initdb -D /var/lib/pgsql/data-locallock vi /var/lib/pgsql/data-locallock/postgresql.conf port = 44301 plan_cache_mode = 'auto' or 'force_custom_plan' max_parallel_workers = 0 max_parallel_workers_per_gather = 0 max_locks_per_transaction = 4096 pg_ctl -D /var/lib/pgsql/data-locallock start And I tested as follows. (creating partitioned table for port 44201) (creating partitioned table for port 44301) (creating select4096.sql) for i in `seq 1 5`; do pgbench -n -f select4096.sql -T 60 -M prepared -p 44201 | grep including; pgbench -n -f select4096.sql -T 60 -M prepared -p 44301 | grep including; done tps = 8146.039546 (including connections establishing) tps = 9021.340872 (including connections establishing) tps = 8011.186017 (including connections establishing) tps = 8926.191054 (including connections establishing) tps = 8006.769690 (including connections establishing) tps = 9028.716806 (including connections establishing) tps = 8057.709961 (including connections establishing) tps = 9017.713714 (including connections establishing) tps = 7956.332863 (including connections establishing) tps = 9126.650533 (including connections establishing) Thanks -- Yoshikazu Imai
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Amit-san, Imai-snan, From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] > I was able to detect it as follows. > plan_cache_mode = auto > >HEAD: 1915 tps > Patched: 2394 tps > > plan_cache_mode = custom (non-problematic: generic plan is never created) > >HEAD: 2402 tps > Patched: 2393 tps Thanks a lot for very quick confirmation. I'm relieved to still see the good results. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
On Fri, Apr 5, 2019 at 1:31 AM, Amit Langote wrote: > On 2019/04/05 5:42, Peter Eisentraut wrote: > > On 2019-04-04 06:58, Amit Langote wrote: > >> Also, since the "speed up partition planning" patch went in > >> (428b260f8), it might be possible to see the performance boost even > >> with the partitioning example you cited upthread. > > > > I can't detect any performance improvement with the patch applied to > > current master, using the test case from Yoshikazu Imai (2019-03-19). > > I was able to detect it as follows. > > * partitioned table setup: > > $ cat ht.sql > drop table ht cascade; > create table ht (a int primary key, b int, c int) partition by hash (a); > select 'create table ht' || x::text || ' partition of ht for values with > (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0, > 8191) x; > \gexec > > * pgbench script: > > $ cat select.sql > \set param random(1, 8192) > select * from ht where a = :param > > * pgbench (5 minute run with -M prepared) > > pgbench -n -M prepared -T 300 -f select.sql > > * tps: > > plan_cache_mode = auto > >HEAD: 1915 tps > Patched: 2394 tps > > plan_cache_mode = custom (non-problematic: generic plan is never created) > >HEAD: 2402 tps > Patched: 2393 tps Amit-san, thanks for testing this. I also re-ran my tests(3/19) with HEAD(413ccaa) and HEAD(413ccaa) + patched, and I can still detect the performance difference with plan_cache_mode = auto. Thanks -- Yoshikazu Imai
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 2019/04/05 5:42, Peter Eisentraut wrote: > On 2019-04-04 06:58, Amit Langote wrote: >> Also, since the "speed up partition planning" patch went in (428b260f8), >> it might be possible to see the performance boost even with the >> partitioning example you cited upthread. > > I can't detect any performance improvement with the patch applied to > current master, using the test case from Yoshikazu Imai (2019-03-19). I was able to detect it as follows. * partitioned table setup: $ cat ht.sql drop table ht cascade; create table ht (a int primary key, b int, c int) partition by hash (a); select 'create table ht' || x::text || ' partition of ht for values with (modulus 8192, remainder ' || (x)::text || ');' from generate_series(0, 8191) x; \gexec * pgbench script: $ cat select.sql \set param random(1, 8192) select * from ht where a = :param * pgbench (5 minute run with -M prepared) pgbench -n -M prepared -T 300 -f select.sql * tps: plan_cache_mode = auto HEAD: 1915 tps Patched: 2394 tps plan_cache_mode = custom (non-problematic: generic plan is never created) HEAD: 2402 tps Patched: 2393 tps Thanks, Amit
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, Imai-san, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > I can't detect any performance improvement with the patch applied to > current master, using the test case from Yoshikazu Imai (2019-03-19). That's strange... Peter, Imai-san, can you compare your test procedures? Peter, can you check and see the performance improvement with David's method on March 26 instead? Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 2019-04-04 06:58, Amit Langote wrote: > Also, since the "speed up partition planning" patch went in (428b260f8), > it might be possible to see the performance boost even with the > partitioning example you cited upthread. I can't detect any performance improvement with the patch applied to current master, using the test case from Yoshikazu Imai (2019-03-19). -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Hi, On 2019/04/04 13:37, Tsunakawa, Takayuki wrote: > Hi Peter, > > From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] >> I did a bit of performance testing, both a plain pgbench and the >> suggested test case with 4096 partitions. I can't detect any >> performance improvements. In fact, within the noise, it tends to be >> just a bit on the slower side. >> >> So I'd like to kick it back to the patch submitter now and ask for more >> justification and performance analysis. >> >> Perhaps "speeding up planning with partitions" needs to be accepted first? > > David kindly showed how to demonstrate the performance improvement on March > 26, so I changed the status to needs review. I'd appreciate it if you could > continue the final check. Also, since the "speed up partition planning" patch went in (428b260f8), it might be possible to see the performance boost even with the partitioning example you cited upthread. Thanks, Amit
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > I did a bit of performance testing, both a plain pgbench and the > suggested test case with 4096 partitions. I can't detect any > performance improvements. In fact, within the noise, it tends to be > just a bit on the slower side. > > So I'd like to kick it back to the patch submitter now and ask for more > justification and performance analysis. > > Perhaps "speeding up planning with partitions" needs to be accepted first? David kindly showed how to demonstrate the performance improvement on March 26, so I changed the status to needs review. I'd appreciate it if you could continue the final check. Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > Here a benchmark doing that using pgbench's script weight feature. Wow, I didn't know that pgbench has evolved to have such a convenient feature. Thanks for telling me how to utilize it in testing. PostgreSQL is cool! Regards Takayuki Tsunakawa
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Amit Langote [mailto:langote_amit...@lab.ntt.co.jp] > My understanding of what David wrote is that the slowness of bloated hash > table is hard to notice, because planning itself is pretty slow. With the > "speeding up planning with partitions" patch, planning becomes quite fast, > so the bloated hash table overhead and so your patch's benefit is easier > to notice. This patch is clearly helpful, but it's just hard to notice > it > when the other big bottleneck is standing in the way. Ah, I see. I failed to recognize the simple fact that without your patch, EXECUTE on a table with many partitions is slow due to the custom planning time proportional to the number of partitions. Thanks for waking up my sleeping head! Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Tue, 26 Mar 2019 at 21:55, David Rowley wrote: > > On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki > wrote: > > Thank you David for explaining. Although I may not understand the effect > > of "speeding up planning with partitions" patch, this patch takes effect > > even without it. That is, perform the following in the same session: > > > > 1. SELECT count(*) FROM table; on a table with many partitions. That > > bloats the LocalLockHash. > > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; > > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate > > transaction. Without the patch, each transaction's LockReleaseAll() has to > > scan the bloated large hash table. > > Oh. I think I see what you're saying. Really the table in #2 would > have to be some completely different table that's not partitioned. I > think in that case it should make a difference. Here a benchmark doing that using pgbench's script weight feature. I've set this up so the query that hits the partitioned table runs once for every 10k times the other script runs. I picked that number so the lock table was expanded fairly early on in the benchmark. setup: create table t1 (a int primary key); create table hp (a int) partition by hash (a); select 'create table hp'||x|| ' partition of hp for values with (modulus 4096, remainder ' || x || ');' from generate_series(0,4095) x; \gexec hp.sql select count(*) from hp; t1.sql \set p 1 select a from t1 where a = :p; Master = c8c885b7a5 Master: $ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@1 postgres SQL script 2: t1.sql - 1057306 transactions (100.0% of total, tps = 17621.756473) - 1081905 transactions (100.0% of total, tps = 18021.449914) - 1122420 transactions (100.0% of total, tps = 18690.804699) Master + 0002-speed-up-LOCALLOCK-scan.patch $ pgbench -T 60 -M prepared -n -f hp.sql@1 -f t1.sql@1 postgres SQL script 2: t1.sql - 1277014 transactions (100.0% of total, tps = 21283.551615) - 1184052 transactions (100.0% of total, tps = 19734.185872) - 1188523 transactions (100.0% of total, tps = 19785.835662) -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Tue, 26 Mar 2019 at 21:23, Tsunakawa, Takayuki wrote: > Thank you David for explaining. Although I may not understand the effect of > "speeding up planning with partitions" patch, this patch takes effect even > without it. That is, perform the following in the same session: > > 1. SELECT count(*) FROM table; on a table with many partitions. That bloats > the LocalLockHash. > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate > transaction. Without the patch, each transaction's LockReleaseAll() has to > scan the bloated large hash table. Oh. I think I see what you're saying. Really the table in #2 would have to be some completely different table that's not partitioned. I think in that case it should make a difference. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
Tsunakawa-san, On 2019/03/26 17:21, Tsunakawa, Takayuki wrote: > From: David Rowley [mailto:david.row...@2ndquadrant.com] >> On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut >> wrote: >>> Perhaps "speeding up planning with partitions" needs to be accepted first? >> >> Yeah, I think it likely will require that patch to be able to measure >> the gains from this patch. >> >> If planning a SELECT to a partitioned table with a large number of >> partitions using PREPAREd statements, when we attempt the generic plan >> on the 6th execution, it does cause the local lock table to expand to >> fit all the locks for each partition. This does cause the >> LockReleaseAll() to become slow due to the hash_seq_search having to >> skip over many empty buckets. Since generating a custom plan for a >> partitioned table with many partitions is still slow in master, then I >> very much imagine you'll struggle to see the gains brought by this >> patch. > > Thank you David for explaining. Although I may not understand the effect of > "speeding up planning with partitions" patch, this patch takes effect even > without it. That is, perform the following in the same session: > > 1. SELECT count(*) FROM table; on a table with many partitions. That bloats > the LocalLockHash. > 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; > 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate > transaction. Without the patch, each transaction's LockReleaseAll() has to > scan the bloated large hash table. My understanding of what David wrote is that the slowness of bloated hash table is hard to notice, because planning itself is pretty slow. With the "speeding up planning with partitions" patch, planning becomes quite fast, so the bloated hash table overhead and so your patch's benefit is easier to notice. This patch is clearly helpful, but it's just hard to notice it when the other big bottleneck is standing in the way. Thanks, Amit
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: David Rowley [mailto:david.row...@2ndquadrant.com] > On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut > wrote: > > Perhaps "speeding up planning with partitions" needs to be accepted first? > > Yeah, I think it likely will require that patch to be able to measure > the gains from this patch. > > If planning a SELECT to a partitioned table with a large number of > partitions using PREPAREd statements, when we attempt the generic plan > on the 6th execution, it does cause the local lock table to expand to > fit all the locks for each partition. This does cause the > LockReleaseAll() to become slow due to the hash_seq_search having to > skip over many empty buckets. Since generating a custom plan for a > partitioned table with many partitions is still slow in master, then I > very much imagine you'll struggle to see the gains brought by this > patch. Thank you David for explaining. Although I may not understand the effect of "speeding up planning with partitions" patch, this patch takes effect even without it. That is, perform the following in the same session: 1. SELECT count(*) FROM table; on a table with many partitions. That bloats the LocalLockHash. 2. PREPARE a point query, e.g., SELECT * FROM table WHERE pkey = $1; 3. EXECUTE the PREPAREd query repeatedly, with each EXECUTE in a separate transaction. Without the patch, each transaction's LockReleaseAll() has to scan the bloated large hash table. Regards Takayuki Tsunakawa
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On Mon, 25 Mar 2019 at 23:44, Peter Eisentraut wrote: > I did a bit of performance testing, both a plain pgbench and the > suggested test case with 4096 partitions. I can't detect any > performance improvements. In fact, within the noise, it tends to be > just a bit on the slower side. > > So I'd like to kick it back to the patch submitter now and ask for more > justification and performance analysis. > > Perhaps "speeding up planning with partitions" needs to be accepted first? Yeah, I think it likely will require that patch to be able to measure the gains from this patch. If planning a SELECT to a partitioned table with a large number of partitions using PREPAREd statements, when we attempt the generic plan on the 6th execution, it does cause the local lock table to expand to fit all the locks for each partition. This does cause the LockReleaseAll() to become slow due to the hash_seq_search having to skip over many empty buckets. Since generating a custom plan for a partitioned table with many partitions is still slow in master, then I very much imagine you'll struggle to see the gains brought by this patch. I did a quick benchmark too and couldn't measure anything: create table hp (a int) partition by hash (a); select 'create table hp'||x|| ' partition of hp for values with (modulus 4096, remainder ' || x || ');' from generate_series(0,4095) x; bench.sql \set p_a 13315 select * from hp where a = :p_a; Master: $ pgbench -M prepared -n -T 60 -f bench.sql postgres tps = 31.844468 (excluding connections establishing) tps = 32.950154 (excluding connections establishing) tps = 31.534761 (excluding connections establishing) Patched: $ pgbench -M prepared -n -T 60 -f bench.sql postgres tps = 30.099133 (excluding connections establishing) tps = 32.157328 (excluding connections establishing) tps = 32.329884 (excluding connections establishing) The situation won't be any better with plan_cache_mode = force_generic_plan either. In this case, we'll only plan once but we'll also have to obtain and release a lock for each partition for each execution of the prepared statement. LockReleaseAll() is going to be slow in that case because it actually has to release a lot of locks. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 2019-03-19 16:38, Peter Eisentraut wrote: > On 2019-03-19 10:21, Tsunakawa, Takayuki wrote: >> From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] >>> Fixed. >> >> Rebased on HEAD. > > I have committed the first patch that reorganizes the struct. I'll have > to spend some time evaluating the performance impact of the second > patch, but it seems OK in principle. Performance tests from others welcome. I did a bit of performance testing, both a plain pgbench and the suggested test case with 4096 partitions. I can't detect any performance improvements. In fact, within the noise, it tends to be just a bit on the slower side. So I'd like to kick it back to the patch submitter now and ask for more justification and performance analysis. Perhaps "speeding up planning with partitions" needs to be accepted first? -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Re: Speed up transaction completion faster after many relations are accessed in a transaction
On 2019-03-19 10:21, Tsunakawa, Takayuki wrote: > From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] >> Fixed. > > Rebased on HEAD. I have committed the first patch that reorganizes the struct. I'll have to spend some time evaluating the performance impact of the second patch, but it seems OK in principle. Performance tests from others welcome. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
RE: Speed up transaction completion faster after many relations are accessed in a transaction
From: Tsunakawa, Takayuki [mailto:tsunakawa.ta...@jp.fujitsu.com] > Fixed. Rebased on HEAD. Regards Takayuki Tsunakawa 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch 0002-speed-up-LOCALLOCK-scan.patch Description: 0002-speed-up-LOCALLOCK-scan.patch
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Tsunakawa-san, Peter On Tue, Mar 19, 2019 at 7:53 AM, Tsunakawa, Takayuki wrote: > From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > > You posted a link to some performance numbers, but I didn't see the > > test setup explained there. I'd like to get some more information on > > this impact of this. Is there an effect with 100 tables, or do you > need 10? > > Imai-san, can you tell us the test setup? Maybe I used this test setup[1]. I tested again with those settings for prepared transactions. I used Tsunakawa-san's patch for locallock[2] (which couldn't be applied to current master so I fixed it) and Amit's v32 patch for speeding up planner[3]. [settings] plan_cache_mode = 'auto' or 'force_custom_plan' max_parallel_workers = 0 max_parallel_workers_per_gather = 0 max_locks_per_transaction = 4096 [partitioning table definitions(with 4096 partitions)] create table rt (a int, b int, c int) partition by range (a); \o /dev/null select 'create table rt' || x::text || ' partition of rt for values from (' || (x)::text || ') to (' || (x+1)::text || ');' from generate_series(1, 4096) x; \gexec \o [select4096.sql] \set a random(1, 4096) select a from rt where a = :a; [pgbench(with 4096 partitions)] pgbench -n -f select4096.sql -T 60 -M prepared [results] master locallock v32 v32+locallock -- - --- - auto21.9 22.96,834 7,355 custom 19.7 20.07,415 7,252 [1] https://www.postgresql.org/message-id/0F97FA9ABBDBE54F91744A9B37151A51256276%40g01jpexmbkw24 [2] https://www.postgresql.org/message-id/0A3221C70F24FB45833433255569204D1FBDFA00%40G01JPEXMBYT05 [3] https://www.postgresql.org/message-id/9feacaf6-ddb3-96dd-5b98-df5e927b1439%40lab.ntt.co.jp -- Yoshikazu Imai
RE: Speed up transaction completion faster after many relations are accessed in a transaction
Hi Peter, Imai-san, From: Peter Eisentraut [mailto:peter.eisentr...@2ndquadrant.com] > Your changes in LOCALLOCK still refer to PGPROC, from your first version > of the patch. > > I think the reordering of struct members could be done as a separate > preliminary patch. > > Some more documentation in the comment before dlist_head LocalLocks to > explain this whole mechanism would be nice. Fixed. > You posted a link to some performance numbers, but I didn't see the test > setup explained there. I'd like to get some more information on this > impact of this. Is there an effect with 100 tables, or do you need 10? Imai-san, can you tell us the test setup? Regards Takayuki Tsunakawa 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch Description: 0001-reorder-LOCALLOCK-structure-members-to-compact-the-s.patch 0002-speed-up-LOCALLOCK-scan.patch Description: 0002-speed-up-LOCALLOCK-scan.patch