Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 10/14/2014 03:59 PM, MauMau wrote: BTW, in LWLockWaitForVar(), the first line of the following code fragment is not necessary, because lwWaitLink is set to head immediately. I think it would be good to eliminate as much unnecessary code as possible from the spinlock section. proc-lwWaitLink = NULL; /* waiters are added to the front of the queue */ proc-lwWaitLink = lock-head; Thanks, fixed! - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
From: MauMau maumau...@gmail.com Thank you very much. I didn't anticipate such a difficult complicated cause. The user agreed to try the patch tonight. I'll report back the result as soon as I got it from him. The test ran successfully without hang for 24 hours. It was run with your patch + the following: BTW, in LWLockWaitForVar(), the first line of the following code fragment is not necessary, because lwWaitLink is set to head immediately. I think it would be good to eliminate as much unnecessary code as possible from the spinlock section. proc-lwWaitLink = NULL; /* waiters are added to the front of the queue */ proc-lwWaitLink = lock-head; Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 10/13/2014 06:57 PM, Heikki Linnakangas wrote: Hmm, we could set releaseOK in LWLockWaitForVar(), though, when it (re-)queues the backend. That would be less invasive, for sure (attached). I like this better. Committed this. - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
From: Heikki Linnakangas hlinnakan...@vmware.com Committed this. Thank you very much. I didn't anticipate such a difficult complicated cause. The user agreed to try the patch tonight. I'll report back the result as soon as I got it from him. BTW, in LWLockWaitForVar(), the first line of the following code fragment is not necessary, because lwWaitLink is set to head immediately. I think it would be good to eliminate as much unnecessary code as possible from the spinlock section. proc-lwWaitLink = NULL; /* waiters are added to the front of the queue */ proc-lwWaitLink = lock-head; Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 10/10/2014 05:08 PM, MauMau wrote: From: Craig Ringer cr...@2ndquadrant.com It sounds like they've produced a test case, so they should be able to with a bit of luck. Or even better, send you the test case. I asked the user about this. It sounds like the relevant test case consists of many scripts. He explained to me that the simplified test steps are: 1. initdb 2. pg_ctl start 3. Create 16 tables. Each of those tables consist of around 10 columns. 4. Insert 1000 rows into each of those 16 tables. 5. Launch 16 psql sessions concurrently. Each session updates all 1000 rows of one table, e.g., session 1 updates table 1, session 2 updates table 2, and so on. 6. Repeat step 5 50 times. This sounds a bit complicated, but I understood that the core part is 16 concurrent updates, which should lead to contention on xlog insert slots and/or spinlocks. I was able to reproduce this. I reduced wal_buffers to 64kB, and NUM_XLOGINSERT_LOCKS to 4 to increase the probability of the deadlock, and ran a test case as above on my laptop for several hours, and it finally hung. Will investigate... - Heikki -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 2014-10-13 17:56:10 +0300, Heikki Linnakangas wrote: So the gist of the problem is that LWLockRelease doesn't wake up LW_WAIT_UNTIL_FREE waiters, when releaseOK == false. It should, because a LW_WAIT_UNTIL FREE waiter is now free to run if the variable has changed in value, and it won't steal the lock from the other backend that's waiting to get the lock in exclusive mode, anyway. I'm not a big fan of that change. Right now we don't iterate the waiters if releaseOK isn't set. Which is good for the normal lwlock code because it avoids pointer indirections (of stuff likely residing on another cpu). Wouldn't it be more sensible to reset releaseOK in *UpdateVar()? I might just miss something here. I noticed another potential bug: LWLockAcquireCommon doesn't use a volatile pointer when it sets the value of the protected variable: /* If there's a variable associated with this lock, initialize it */ if (valptr) *valptr = val; /* We are done updating shared state of the lock itself. */ SpinLockRelease(lock-mutex); If the compiler or CPU decides to reorder those two, so that the variable is set after releasing the spinlock, things will break. Good catch. As Robert says that should be fine with master, but 9.4 obviously needs it. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 10/13/2014 06:26 PM, Andres Freund wrote: On 2014-10-13 17:56:10 +0300, Heikki Linnakangas wrote: So the gist of the problem is that LWLockRelease doesn't wake up LW_WAIT_UNTIL_FREE waiters, when releaseOK == false. It should, because a LW_WAIT_UNTIL FREE waiter is now free to run if the variable has changed in value, and it won't steal the lock from the other backend that's waiting to get the lock in exclusive mode, anyway. I'm not a big fan of that change. Right now we don't iterate the waiters if releaseOK isn't set. Which is good for the normal lwlock code because it avoids pointer indirections (of stuff likely residing on another cpu). I can't get excited about that. It's pretty rare for releaseOK to be false, and when it's true, you iterate the waiters anyway. Wouldn't it be more sensible to reset releaseOK in *UpdateVar()? I might just miss something here. That's not enough. There's no LWLockUpdateVar involved in the example I gave. And LWLockUpdateVar() already wakes up all LW_WAIT_UNTIL_FREE waiters, regardless of releaseOK. Hmm, we could set releaseOK in LWLockWaitForVar(), though, when it (re-)queues the backend. That would be less invasive, for sure (attached). I like this better. BTW, attached is a little test program I wrote to reproduce this more easily. It exercises the LWLock* calls directly. To run, make and install, and do CREATE EXTENSION lwlocktest. Then launch three backends concurrently that run select lwlocktest(1), select lwlocktest(2) and select lwlocktest(3). It will deadlock within seconds. - Heikki diff --git a/src/backend/storage/lmgr/lwlock.c b/src/backend/storage/lmgr/lwlock.c index 5453549..cee3f08 100644 --- a/src/backend/storage/lmgr/lwlock.c +++ b/src/backend/storage/lmgr/lwlock.c @@ -482,6 +482,7 @@ static inline bool LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val) { volatile LWLock *lock = l; + volatile uint64 *valp = valptr; PGPROC *proc = MyProc; bool retry = false; bool result = true; @@ -637,8 +638,8 @@ LWLockAcquireCommon(LWLock *l, LWLockMode mode, uint64 *valptr, uint64 val) } /* If there's a variable associated with this lock, initialize it */ - if (valptr) - *valptr = val; + if (valp) + *valp = val; /* We are done updating shared state of the lock itself. */ SpinLockRelease(lock-mutex); @@ -976,6 +977,12 @@ LWLockWaitForVar(LWLock *l, uint64 *valptr, uint64 oldval, uint64 *newval) lock-tail = proc; lock-head = proc; + /* + * Set releaseOK, to make sure we get woken up as soon as the lock is + * released. + */ + lock-releaseOK = true; + /* Can release the mutex now */ SpinLockRelease(lock-mutex); lwlocktest.tar.gz Description: application/gzip -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
From: Craig Ringer cr...@2ndquadrant.com It sounds like they've produced a test case, so they should be able to with a bit of luck. Or even better, send you the test case. I asked the user about this. It sounds like the relevant test case consists of many scripts. He explained to me that the simplified test steps are: 1. initdb 2. pg_ctl start 3. Create 16 tables. Each of those tables consist of around 10 columns. 4. Insert 1000 rows into each of those 16 tables. 5. Launch 16 psql sessions concurrently. Each session updates all 1000 rows of one table, e.g., session 1 updates table 1, session 2 updates table 2, and so on. 6. Repeat step 5 50 times. This sounds a bit complicated, but I understood that the core part is 16 concurrent updates, which should lead to contention on xlog insert slots and/or spinlocks. Your next step here really needs to be to make this reproducible against a debug build. Then see if reverting the xlog scalability work actually changes the behaviour, given that you hypothesised that it could be involved. Thank you, but that may be labor-intensive and time-consuming. In addition, the user uses a machine with multiple CPU cores, while I only have a desktop PC with two CPU cores. So I doubt I can reproduce the problem on my PC. I asked the user to change S_UNLOCK to something like the following and run the test during this weekend (the next Monday is a national holiday in Japan). #define S_UNLOCK(lock) InterlockedExchange(lock, 0) FYI, the user reported today that the problem didn't occur when he ran the same test for 24 hours on 9.3.5. Do you see something relevant in 9.4? Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 2014-10-10 23:08:34 +0900, MauMau wrote: From: Craig Ringer cr...@2ndquadrant.com It sounds like they've produced a test case, so they should be able to with a bit of luck. Or even better, send you the test case. I asked the user about this. It sounds like the relevant test case consists of many scripts. He explained to me that the simplified test steps are: 1. initdb 2. pg_ctl start 3. Create 16 tables. Each of those tables consist of around 10 columns. 4. Insert 1000 rows into each of those 16 tables. 5. Launch 16 psql sessions concurrently. Each session updates all 1000 rows of one table, e.g., session 1 updates table 1, session 2 updates table 2, and so on. 6. Repeat step 5 50 times. This sounds a bit complicated, but I understood that the core part is 16 concurrent updates, which should lead to contention on xlog insert slots and/or spinlocks. Hm. I've run similar loads on linux for long enough that I'm relatively sure I'd have seen this. Could you get them to print out the content's of the lwlock all these processes are waiting for? Your next step here really needs to be to make this reproducible against a debug build. Then see if reverting the xlog scalability work actually changes the behaviour, given that you hypothesised that it could be involved. I don't think you can trivially revert the xlog scalability stuff. Thank you, but that may be labor-intensive and time-consuming. In addition, the user uses a machine with multiple CPU cores, while I only have a desktop PC with two CPU cores. So I doubt I can reproduce the problem on my PC. Well, it'll also be labor intensive for the community to debug. I asked the user to change S_UNLOCK to something like the following and run the test during this weekend (the next Monday is a national holiday in Japan). #define S_UNLOCK(lock) InterlockedExchange(lock, 0) That shouldn't be required. For one, on 9.4 (not 9.5!) spinlock releases only need to prevent reordering on the CPU level. As x86 is a TSO architecture (total store order) that doesn't require doing anything special. And even if it'd require more, on msvc volatile reads/stores act as acquire/release fences unless you monkey with the compiler settings. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
Hello, One user reported a hang problem with 9.4 beta2 on Windows. The PostgreSQL is 64-bit version. I couldn't find the cause, but want to solve the problem. Could you help with this? I heard that the user had run 16 concurrent psql sessions which executes INSERT and UPDATE statements, which is a write-intensive stress test. He encountered the hang phenomenon twice, one of which occured several hours after the start of the test, and the other occured about an hour after the test launch. The user gave me the stack traces, which I attached at the end of this mail. The problem appears to be related to the xlog insert scaling. But I can't figure out where the root cause lies --- WAL slot handling, spinlock on Windows, or PGSemaphoreLock/UnLock on Windows? The place I suspect is S_UNLOCK(). It doesn't use any memory barrier. Is this correct on Intel64 processors? #define S_UNLOCK(lock) (*((volatile slock_t *) (lock)) = 0) The rest of this mail is the stack trace: `0043e0a8 7ff8`213d12ee : `0002 `0002 `0001 ` : ntdll!ZwWaitForMultipleObjects+0xa `0043e0b0 0001`401de68e : ` 7ff5`e000 ` `04fb6b40 : KERNELBASE!WaitForMultipleObjectsEx+0xe1 `0043e390 0001`4023cf11 : `02a55500 `1c117410 80605042`36ad2501 0001`405546e0 : postgres!PGSemaphoreLock+0x6e [d:\pginstaller.auto\postgres.windows-x64\src\backend\port\win32_sema.c @ 145] `0043e3e0 0001`4006203b : `f9017d56 `0022 ` `0400 : postgres!LWLockAcquireCommon+0x121 [d:\pginstaller.auto\postgres.windows-x64\src\backend\storage\lmgr\lwlock.c @ 625] `0043e430 0001`4002c182 : `0005 ` `004e2f00 ` : postgres!XLogInsert+0x62b [d:\pginstaller.auto\postgres.windows-x64\src\backend\access\transam\xlog.c @ 1110] `0043e700 0001`400323b6 : ` ` `0a63 `0289de10 : postgres!log_heap_clean+0x102 [d:\pginstaller.auto\postgres.windows-x64\src\backend\access\heap\heapam.c @ 6561] `0043e7e0 0001`400320e8 : `040ec5c0 `0a63 `0043f340 `040ec5c0 : postgres!heap_page_prune+0x2a6 [d:\pginstaller.auto\postgres.windows-x64\src\backend\access\heap\pruneheap.c @ 261] `0043f2f0 0001`4002dc40 : `0057ea30 ` ` `028d1810 : postgres!heap_page_prune_opt+0x148 [d:\pginstaller.auto\postgres.windows-x64\src\backend\access\heap\pruneheap.c @ 150] `0043f340 0001`4002e7da : `028d1800 `0d26 `0005 `0057ea30 : postgres!heapgetpage+0xa0 [d:\pginstaller.auto\postgres.windows-x64\src\backend\access\heap\heapam.c @ 355] `0043f3e0 0001`4002802c : ` ` ` ` : postgres!heapgettup_pagemode+0x40a [d:\pginstaller.auto\postgres.windows-x64\src\backend\access\heap\heapam.c @ 944] `0043f460 0001`40126507 : ` `001d ` `001d : postgres!heap_getnext+0x1c [d:\pginstaller.auto\postgres.windows-x64\src\backend\access\heap\heapam.c @ 1478] `0043f490 0001`401137f5 : `028d05b0 `028d06c0 ` `028a3d30 : postgres!SeqNext+0x27 [d:\pginstaller.auto\postgres.windows-x64\src\backend\executor\nodeseqscan.c @ 76] `0043f4c0 0001`4010c7b2 : `0058dba0 `028d05b0 ` ` : postgres!ExecScan+0xd5 [d:\pginstaller.auto\postgres.windows-x64\src\backend\executor\execscan.c @ 167] `0043f520 0001`4012448d : `028d02e0 `028d02d8 `028d02e0 `00585a00 : postgres!ExecProcNode+0xd2 [d:\pginstaller.auto\postgres.windows-x64\src\backend\executor\execprocnode.c @ 400] `0043f550 0001`4010c772 : `00587bc0 `028d0110 ` `028d0258 : postgres!ExecModifyTable+0x10d [d:\pginstaller.auto\postgres.windows-x64\src\backend\executor\nodemodifytable.c @ 926] `0043f610 0001`4010bb6d : `028d0110 `00587bc0 ` `0056c740 : postgres!ExecProcNode+0x92 [d:\pginstaller.auto\postgres.windows-x64\src\backend\executor\execprocnode.c @ 377] `0043f640 0001`401099d8 : `00570ff0 `0051e400 `028d0110 `005831f0 : postgres!ExecutePlan+0x5d [d:\pginstaller.auto\postgres.windows-x64\src\backend\executor\execmain.c @ 1481] `0043f680 0001`4024f813 : `00570ff0 `0051e468 `0051c530 `005831f0 : postgres!standard_ExecutorRun+0xa8 [d:\pginstaller.auto\postgres.windows-x64\src\backend\executor\execmain.c @ 319] `0043f6f0 0001`4024ff5a :
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 10/09/2014 09:47 PM, MauMau wrote: I heard that the user had run 16 concurrent psql sessions which executes INSERT and UPDATE statements, which is a write-intensive stress test. He encountered the hang phenomenon twice, one of which occured several hours after the start of the test, and the other occured about an hour after the test launch. It'd be interesting and useful to run this test on a debug build of PostgreSQL, i.e. one compiled against the debug version of the C library and with full debuginfo not just minimal .pdb. How were the stacks captured - what tool? -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 2014-10-09 22:47:48 +0900, MauMau wrote: Hello, One user reported a hang problem with 9.4 beta2 on Windows. The PostgreSQL is 64-bit version. I couldn't find the cause, but want to solve the problem. Could you help with this? I heard that the user had run 16 concurrent psql sessions which executes INSERT and UPDATE statements, which is a write-intensive stress test. He encountered the hang phenomenon twice, one of which occured several hours after the start of the test, and the other occured about an hour after the test launch. The user gave me the stack traces, which I attached at the end of this mail. The problem appears to be related to the xlog insert scaling. But I can't figure out where the root cause lies --- WAL slot handling, spinlock on Windows, or PGSemaphoreLock/UnLock on Windows? The place I suspect is S_UNLOCK(). It doesn't use any memory barrier. Is this correct on Intel64 processors? What precisely do you mean with Intel64? 64bit x86 or Itanium? Also, what's the precise workload? Can you reproduce the problem? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
From: Craig Ringer cr...@2ndquadrant.com It'd be interesting and useful to run this test on a debug build of PostgreSQL, i.e. one compiled against the debug version of the C library and with full debuginfo not just minimal .pdb. Although I'm not sure the user can do this now, I'll ask him anyway. How were the stacks captured - what tool? According to his mail, Windbg or userdump.exe. I'll ask him about this. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
From: Andres Freund and...@2ndquadrant.com What precisely do you mean with Intel64? 64bit x86 or Itanium? 64-bit x86, i.e. x86-64. Also, what's the precise workload? Can you reproduce the problem? IIUC, each client inserts 1000 records into one table, then repeats updating all those records. I'll ask him again. Regards MauMau -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] [9.4 bug] The database server hangs with write-heavy workload on Windows
On 10/10/2014 04:16 AM, MauMau wrote: From: Craig Ringer cr...@2ndquadrant.com It'd be interesting and useful to run this test on a debug build of PostgreSQL, i.e. one compiled against the debug version of the C library and with full debuginfo not just minimal .pdb. Although I'm not sure the user can do this now, I'll ask him anyway. It sounds like they've produced a test case, so they should be able to with a bit of luck. Or even better, send you the test case. How were the stacks captured - what tool? According to his mail, Windbg or userdump.exe. I'll ask him about this. Thanks. The stack trace looks fairly sane, i.e. there's nothing obviously out of whack at a glance, but I tend to get more informative traces from Visual Studio debugging sessions. Your next step here really needs to be to make this reproducible against a debug build. Then see if reverting the xlog scalability work actually changes the behaviour, given that you hypothesised that it could be involved. As I said off-list, if you can narrow the test case down to something that can be reproduced more quickly, you could also git-bisect to seek the commit at fault. Even if the test case takes an hour, that's still viable: $ git bisect start $ git bisect bad $ git bisect good REL9_3_RC1 Bisecting: a merge base must be tested [e472b921406407794bab911c64655b8b82375196] Avoid deadlocks during insertion into SP-GiST indexes. $ git bisect good Bisecting: 1026 revisions left to test after this (roughly 10 steps) ... Thanks to the magic of binary search. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers