Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Added to TODO list. --- Simon Riggs wrote: On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote: I wrote: Another thought came to mind: maybe the current data layout for LWLocks is bad. Right now, the spinlock that protects each LWLock data struct is itself part of the struct, and since the structs aren't large (circa 20 bytes), the whole thing is usually all in the same cache line. ... Maybe it'd be better to allocate the spinlocks off by themselves. Well, this is odd. I made up a patch to do this (attached) and found that it pretty much sucks. Still the 4-way Opteron, previous best (slock-no-cmpb and spin-delay-2): 1 31s 2 42s 4 51s 8 100s with lwlock-separate added: 1 31s 2 52s 4 106s 8 213s What I had expected to see was a penalty in the single-thread case (due to instructions added to the LWLock functions), but there isn't any. I did not expect to see a factor-of-2 penalty for four threads. I guess what this means is that there's no real problem with losing the cache line while manipulating the LWLock, which is what the patch was intended to prevent. Instead, we're paying for swapping two cache lines (the spinlock and the LWLock) across processors instead of just one line. But that should at worst be a 2x inflation of the time previously spent in LWLockAcquire/Release, which is surely not yet all of the application ;-). Why the heck is this so bad? (Just going back through the whole thread for completeness.) Should we expect that apparently minor changes in shared data structures might be costing equivalently huge penalties in SMP performance elsewhere? Yes. That's the advice from Intel and AMD; but we should add that there is potential for improving performance also. The possible problem we were trying to avoid here was false sharing of the cache line by lock requestors of locks whose spin locks were adjacent in memory. Splitting the data structure was just one of the possible ways of avoiding that. The above shows that the possible solution described above didn't work, but there could be others. One thing we tried in February was padding out the statically defined locks with dummy lock definitions in the enum. This has the effect of ensuring that the most contested locks are very definitely in their own cache line and not shared with others. That showed a noticeable improvement in performance, probably because it costs very little to implement, even if the code would require some explanatory documentation. The lwlock structure in include/storage/lwlock.h looks like typedef enum LWLockId { BufMappingLock, BufFreelistLock, LockMgrLock, OidGenLock, XidGenLock, ProcArrayLock, SInvalLock, FreeSpaceLock, WALInsertLock, WALWriteLock, ... Notice that the heavily contested locks (i.e. first 3 and the WAL locks) are adjacent to at least one other heavily contested lock. So they are certain to be in the same cache line and therefore to cause false sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's optimization handbooks). This could be replaced by... typedef enum LWLockId { BufMappingLock, BufMappingLock_Padding1, BufMappingLock_Padding2, BufMappingLock_Padding3, BufMappingLock_Padding4, BufMappingLock_Padding5, BufMappingLock_Padding6, BufMappingLock_Padding7, BufFreelistLock, BufFreelistLock_Padding1, BufFreelistLock_Padding2, BufFreelistLock_Padding3, BufFreelistLock_Padding4, BufFreelistLock_Padding5, BufFreelistLock_Padding6, BufFreelistLock_Padding7, LockMgrLock, LockMgrLock_Padding1, LockMgrLock_Padding2, LockMgrLock_Padding3, LockMgrLock_Padding4, LockMgrLock_Padding5, LockMgrLock_Padding6, LockMgrLock_Padding7, OidGenLock, XidGenLock, ProcArrayLock, SInvalLock, FreeSpaceLock, WALInsertLock, WALInsertLock_Padding1, WALInsertLock_Padding2, WALInsertLock_Padding3, WALInsertLock_Padding4, WALInsertLock_Padding5, WALInsertLock_Padding6, WALInsertLock_Padding7, WALWriteLock, WALWriteLock_Padding1, WALWriteLock_Padding2, WALWriteLock_Padding3, WALWriteLock_Padding4, WALWriteLock_Padding5, WALWriteLock_Padding6, WALWriteLock_Padding7, ... where the number of padding locks is determined by how many lock structures fit within a 128 byte cache line. This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Bruce Momjian pgman@candle.pha.pa.us wrote Added to TODO list. One thing we tried in February was padding out the statically defined locks with dummy lock definitions in the enum. This has the effect of ensuring that the most contested locks are very definitely in their own cache line and not shared with others. That showed a noticeable improvement in performance, probably because it costs very little to implement, even if the code would require some explanatory documentation. Has this been done? See the LWLOCK_PADDED_SIZE macro in code. Regards, Qingqing ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Qingqing Zhou wrote: Bruce Momjian pgman@candle.pha.pa.us wrote Added to TODO list. One thing we tried in February was padding out the statically defined locks with dummy lock definitions in the enum. This has the effect of ensuring that the most contested locks are very definitely in their own cache line and not shared with others. That showed a noticeable improvement in performance, probably because it costs very little to implement, even if the code would require some explanatory documentation. Has this been done? See the LWLOCK_PADDED_SIZE macro in code. Oh, yes, thanks. I thought it had but I couldn't find anything in the area of the code he propsed the patch. -- Bruce Momjian http://candle.pha.pa.us EnterpriseDBhttp://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. + ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Qingqing Zhou [EMAIL PROTECTED] writes: One thing we tried in February was padding out the statically defined locks with dummy lock definitions in the enum. Has this been done? See the LWLOCK_PADDED_SIZE macro in code. Not really --- that patch was intended to ensure that LWLocks don't unnecessarily cross two cache lines. It doesn't ensure that two different LWLocks aren't sharing a cache line. You could do that by increasing LWLOCK_PADDED_SIZE to the cache line size for your hardware, if you know what that is. I think a more effective answer might be to twiddle the order of enum LWLockId items so that the most heavily used LWLocks aren't close together. Haven't looked into it though. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane [EMAIL PROTECTED] wrote Not really --- that patch was intended to ensure that LWLocks don't unnecessarily cross two cache lines. It doesn't ensure that two different LWLocks aren't sharing a cache line. You could do that by increasing LWLOCK_PADDED_SIZE to the cache line size for your hardware, if you know what that is. Exactly, this is one way -- if we make LWLOCK_PADDED_SIZE big enough, we can assure that one lwlock one cacheline. If so, maybe we should plug in a check like LMBench in ./configure to guess out current cacheline size. But this way looks like overkill -- a compromise is to pad only some of the LWLocks big enough but not all (for example, the buffer content lock array). Regards, Qingqing ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Thu, 03 Nov 2005 18:29:09 + Simon Riggs [EMAIL PROTECTED] wrote: On Thu, 2005-11-03 at 08:03 -0800, Mark Wong wrote: On Tue, 01 Nov 2005 07:32:32 + Simon Riggs [EMAIL PROTECTED] wrote: Concerned about the awful checkpointing. Can you bump wal_buffers to 8192 just to make sure? Thats way too high, but just to prove it. We need to rdeuce the number of blocks to be written at checkpoint. bgwriter_all_maxpages 5 - 15 bgwriter_all_percent0.333 bgwriter_delay 200 bgwriter_lru_maxpages 5- 7 bgwriter_lru_percent1 shared_buffersset lower to 10 (which should cause some amusement on-list) Okay, here goes, all with the same source base w/ the lw.patch: http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/ only increased wal_buffers to 8192 from 2048 3242 notpm That looks to me like a clear negative effect from increasing wal_buffers. Try putting it back down to 1024. Looks like we need to plug that gap. http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/ only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7 3019 notpm (but more interesting graph) Man that sucks. What the heck is happening there? Hackers - if you watching you should see this graph - it shows some very poor behaviour. I'm not happy with that performance at all any chance you could re- run that exact same test to see if we can get that repeatably? I see you have vm.dirty_writeback_centisecs = 0 which pretty much means we aren't ever writing to disk by the pdflush daemons, even when the bgwriter is active. Could we set the bgwriter stuff back to default and try vm.dirty_writeback_centisecs = 500 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/47/ 3309 notpm http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/ Same as the previously listen run with hared_buffers lowered to 1 2503 notpm Sorry, that was 100,000 not 10,000. Oops! http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/46/ 2794 notpm Looks like we need dates on the log_line_prefix so we can check the logs. Oops again! I didn't check to make sure I had set this correctly before I ran the last two tests, I'll get on it. ...not sure about the oprofile results. Seems to show CreateLWLocks being as high as xlog_insert, which is mad. Either that shows startup time is excessive, or it means the oprofile timing range is too short. Not sure which. Yeah, we've seen this before. I think I'll have to try pulling the oprofile cvs code to see if there's any improvement. I've been working with oprofile-0.9.1. Mark ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Tue, 01 Nov 2005 07:32:32 + Simon Riggs [EMAIL PROTECTED] wrote: On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote: On Thu, 20 Oct 2005 23:03:47 +0100 Simon Riggs [EMAIL PROTECTED] wrote: On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote: This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a reduction in context switching. I'm not in a position to try this again now on 8.1beta, but I'd welcome a performance test result from anybody that is. I'll supply a patch against 8.1beta for anyone wanting to test this. Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which I've just set up and probably needs a bit of tuning. I don't see much difference but I'm wondering if the cacheline sizes are dramatically different from Intel/AMD processors. I still need to take a closer look to make sure I haven't grossly mistuned anything, but I'll let everyone take a look: Well, the Power 5 architecture probably has the lowest overall memory delay you can get currently so in some ways that would negate the effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's clear the patch isn't significantly better (like it was with 8.0 when we tried this on the 8-way Itanium in Feb). cvs 20051013 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/ 2501 notpm cvs 20051013 w/ lw.patch http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/ 2519 notpm Could you re-run with wal_buffers = 32 ? (Without patch) Thanks Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and redid the disk layout. Here's where I'm at now: cvs 20051013 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/ 3257 notpm cvs 20051013 w/ lw.patch http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/ 3285 notpm Still not much of a difference with the patch. A quick glance over the iostat data suggests I'm still not i/o bound, but the i/o wait is rather high according to vmstat. Will try to see if there's anything else obvious to get the load up higher. OK, thats fine. I'm glad there's some gain, but not much yet. I think we should leave out doing any more tests on lw.patch for now. Concerned about the awful checkpointing. Can you bump wal_buffers to 8192 just to make sure? Thats way too high, but just to prove it. We need to rdeuce the number of blocks to be written at checkpoint. bgwriter_all_maxpages 5 - 15 bgwriter_all_percent0.333 bgwriter_delay 200 bgwriter_lru_maxpages 5- 7 bgwriter_lru_percent1 shared_buffersset lower to 10 (which should cause some amusement on-list) Okay, here goes, all with the same source base w/ the lw.patch: http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/44/ only increased wal_buffers to 8192 from 2048 3242 notpm http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/43/ only increased bgwriter_all_maxpages to 15, and bgwriter_lru_maxpages to 7 3019 notpm (but more interesting graph) http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/45/ Same as the previously listen run with hared_buffers lowered to 1 2503 notpm Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Thu, 20 Oct 2005 23:03:47 +0100 Simon Riggs [EMAIL PROTECTED] wrote: On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote: This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a reduction in context switching. I'm not in a position to try this again now on 8.1beta, but I'd welcome a performance test result from anybody that is. I'll supply a patch against 8.1beta for anyone wanting to test this. Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which I've just set up and probably needs a bit of tuning. I don't see much difference but I'm wondering if the cacheline sizes are dramatically different from Intel/AMD processors. I still need to take a closer look to make sure I haven't grossly mistuned anything, but I'll let everyone take a look: Well, the Power 5 architecture probably has the lowest overall memory delay you can get currently so in some ways that would negate the effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's clear the patch isn't significantly better (like it was with 8.0 when we tried this on the 8-way Itanium in Feb). cvs 20051013 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/ 2501 notpm cvs 20051013 w/ lw.patch http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/ 2519 notpm Could you re-run with wal_buffers = 32 ? (Without patch) Thanks Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and redid the disk layout. Here's where I'm at now: cvs 20051013 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/ 3257 notpm cvs 20051013 w/ lw.patch http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/ 3285 notpm Still not much of a difference with the patch. A quick glance over the iostat data suggests I'm still not i/o bound, but the i/o wait is rather high according to vmstat. Will try to see if there's anything else obvious to get the load up higher. Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Mon, 2005-10-31 at 16:10 -0800, Mark Wong wrote: On Thu, 20 Oct 2005 23:03:47 +0100 Simon Riggs [EMAIL PROTECTED] wrote: On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote: This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a reduction in context switching. I'm not in a position to try this again now on 8.1beta, but I'd welcome a performance test result from anybody that is. I'll supply a patch against 8.1beta for anyone wanting to test this. Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which I've just set up and probably needs a bit of tuning. I don't see much difference but I'm wondering if the cacheline sizes are dramatically different from Intel/AMD processors. I still need to take a closer look to make sure I haven't grossly mistuned anything, but I'll let everyone take a look: Well, the Power 5 architecture probably has the lowest overall memory delay you can get currently so in some ways that would negate the effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's clear the patch isn't significantly better (like it was with 8.0 when we tried this on the 8-way Itanium in Feb). cvs 20051013 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/ 2501 notpm cvs 20051013 w/ lw.patch http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/ 2519 notpm Could you re-run with wal_buffers = 32 ? (Without patch) Thanks Ok, sorry for the delay. I've bumped up the wal_buffers to 2048 and redid the disk layout. Here's where I'm at now: cvs 20051013 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/40/ 3257 notpm cvs 20051013 w/ lw.patch http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/42/ 3285 notpm Still not much of a difference with the patch. A quick glance over the iostat data suggests I'm still not i/o bound, but the i/o wait is rather high according to vmstat. Will try to see if there's anything else obvious to get the load up higher. OK, thats fine. I'm glad there's some gain, but not much yet. I think we should leave out doing any more tests on lw.patch for now. Concerned about the awful checkpointing. Can you bump wal_buffers to 8192 just to make sure? Thats way too high, but just to prove it. We need to rdeuce the number of blocks to be written at checkpoint. bgwriter_all_maxpages 5 - 15 bgwriter_all_percent0.333 bgwriter_delay 200 bgwriter_lru_maxpages 5 - 7 bgwriter_lru_percent1 shared_buffers set lower to 10 (which should cause some amusement on-list) Best Regards, Simon Riggs ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Wed, 2005-10-19 at 14:07 -0700, Mark Wong wrote: This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a reduction in context switching. I'm not in a position to try this again now on 8.1beta, but I'd welcome a performance test result from anybody that is. I'll supply a patch against 8.1beta for anyone wanting to test this. Ok, I've produce a few results on a 4 way (8 core) POWER 5 system, which I've just set up and probably needs a bit of tuning. I don't see much difference but I'm wondering if the cacheline sizes are dramatically different from Intel/AMD processors. I still need to take a closer look to make sure I haven't grossly mistuned anything, but I'll let everyone take a look: Well, the Power 5 architecture probably has the lowest overall memory delay you can get currently so in some ways that would negate the effects of the patch. (Cacheline is still 128 bytes, AFAICS). But it's clear the patch isn't significantly better (like it was with 8.0 when we tried this on the 8-way Itanium in Feb). cvs 20051013 http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/19/ 2501 notpm cvs 20051013 w/ lw.patch http://www.testing.osdl.org/projects/dbt2dev/results/dev4-014/20/ 2519 notpm Could you re-run with wal_buffers = 32 ? (Without patch) Thanks Best Regards, Simon Riggs ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Wed, Oct 12, 2005 at 03:07:23PM -0400, Emil Briggs wrote: where the number of padding locks is determined by how many lock structures fit within a 128 byte cache line. This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a reduction in context switching. I'm not in a position to try this again now on 8.1beta, but I'd welcome a performance test result from anybody that is. I'll supply a patch against 8.1beta for anyone wanting to test this. I don't have an 8 way available right now but I can run tests on a 4 way Opteron if that would be helpful. Likewise I can test on a 4 way opteron running open solaris. -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Wed, 12 Oct 2005 18:44:50 +0100 Simon Riggs [EMAIL PROTECTED] wrote: On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote: I wrote: Another thought came to mind: maybe the current data layout for LWLocks is bad. Right now, the spinlock that protects each LWLock data struct is itself part of the struct, and since the structs aren't large (circa 20 bytes), the whole thing is usually all in the same cache line. ... Maybe it'd be better to allocate the spinlocks off by themselves. Well, this is odd. I made up a patch to do this (attached) and found that it pretty much sucks. Still the 4-way Opteron, previous best (slock-no-cmpb and spin-delay-2): 1 31s 2 42s 4 51s 8 100s with lwlock-separate added: 1 31s 2 52s 4 106s 8 213s What I had expected to see was a penalty in the single-thread case (due to instructions added to the LWLock functions), but there isn't any. I did not expect to see a factor-of-2 penalty for four threads. I guess what this means is that there's no real problem with losing the cache line while manipulating the LWLock, which is what the patch was intended to prevent. Instead, we're paying for swapping two cache lines (the spinlock and the LWLock) across processors instead of just one line. But that should at worst be a 2x inflation of the time previously spent in LWLockAcquire/Release, which is surely not yet all of the application ;-). Why the heck is this so bad? (Just going back through the whole thread for completeness.) Should we expect that apparently minor changes in shared data structures might be costing equivalently huge penalties in SMP performance elsewhere? Yes. That's the advice from Intel and AMD; but we should add that there is potential for improving performance also. The possible problem we were trying to avoid here was false sharing of the cache line by lock requestors of locks whose spin locks were adjacent in memory. Splitting the data structure was just one of the possible ways of avoiding that. The above shows that the possible solution described above didn't work, but there could be others. One thing we tried in February was padding out the statically defined locks with dummy lock definitions in the enum. This has the effect of ensuring that the most contested locks are very definitely in their own cache line and not shared with others. That showed a noticeable improvement in performance, probably because it costs very little to implement, even if the code would require some explanatory documentation. The lwlock structure in include/storage/lwlock.h looks like typedef enum LWLockId { BufMappingLock, BufFreelistLock, LockMgrLock, OidGenLock, XidGenLock, ProcArrayLock, SInvalLock, FreeSpaceLock, WALInsertLock, WALWriteLock, ... Notice that the heavily contested locks (i.e. first 3 and the WAL locks) are adjacent to at least one other heavily contested lock. So they are certain to be in the same cache line and therefore to cause false sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's optimization handbooks). This could be replaced by... typedef enum LWLockId { BufMappingLock, BufMappingLock_Padding1, BufMappingLock_Padding2, BufMappingLock_Padding3, BufMappingLock_Padding4, BufMappingLock_Padding5, BufMappingLock_Padding6, BufMappingLock_Padding7, BufFreelistLock, BufFreelistLock_Padding1, BufFreelistLock_Padding2, BufFreelistLock_Padding3, BufFreelistLock_Padding4, BufFreelistLock_Padding5, BufFreelistLock_Padding6, BufFreelistLock_Padding7, LockMgrLock, LockMgrLock_Padding1, LockMgrLock_Padding2, LockMgrLock_Padding3, LockMgrLock_Padding4, LockMgrLock_Padding5, LockMgrLock_Padding6, LockMgrLock_Padding7, OidGenLock, XidGenLock, ProcArrayLock, SInvalLock, FreeSpaceLock, WALInsertLock, WALInsertLock_Padding1, WALInsertLock_Padding2, WALInsertLock_Padding3, WALInsertLock_Padding4, WALInsertLock_Padding5, WALInsertLock_Padding6, WALInsertLock_Padding7, WALWriteLock, WALWriteLock_Padding1, WALWriteLock_Padding2, WALWriteLock_Padding3, WALWriteLock_Padding4, WALWriteLock_Padding5, WALWriteLock_Padding6, WALWriteLock_Padding7, ... where the number of padding locks is determined by how many lock structures fit within a 128 byte cache line. This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Wed, 2005-09-14 at 13:32 -0400, Tom Lane wrote: I wrote: Another thought came to mind: maybe the current data layout for LWLocks is bad. Right now, the spinlock that protects each LWLock data struct is itself part of the struct, and since the structs aren't large (circa 20 bytes), the whole thing is usually all in the same cache line. ... Maybe it'd be better to allocate the spinlocks off by themselves. Well, this is odd. I made up a patch to do this (attached) and found that it pretty much sucks. Still the 4-way Opteron, previous best (slock-no-cmpb and spin-delay-2): 1 31s 2 42s 4 51s 8 100s with lwlock-separate added: 1 31s 2 52s 4 106s 8 213s What I had expected to see was a penalty in the single-thread case (due to instructions added to the LWLock functions), but there isn't any. I did not expect to see a factor-of-2 penalty for four threads. I guess what this means is that there's no real problem with losing the cache line while manipulating the LWLock, which is what the patch was intended to prevent. Instead, we're paying for swapping two cache lines (the spinlock and the LWLock) across processors instead of just one line. But that should at worst be a 2x inflation of the time previously spent in LWLockAcquire/Release, which is surely not yet all of the application ;-). Why the heck is this so bad? (Just going back through the whole thread for completeness.) Should we expect that apparently minor changes in shared data structures might be costing equivalently huge penalties in SMP performance elsewhere? Yes. That's the advice from Intel and AMD; but we should add that there is potential for improving performance also. The possible problem we were trying to avoid here was false sharing of the cache line by lock requestors of locks whose spin locks were adjacent in memory. Splitting the data structure was just one of the possible ways of avoiding that. The above shows that the possible solution described above didn't work, but there could be others. One thing we tried in February was padding out the statically defined locks with dummy lock definitions in the enum. This has the effect of ensuring that the most contested locks are very definitely in their own cache line and not shared with others. That showed a noticeable improvement in performance, probably because it costs very little to implement, even if the code would require some explanatory documentation. The lwlock structure in include/storage/lwlock.h looks like typedef enum LWLockId { BufMappingLock, BufFreelistLock, LockMgrLock, OidGenLock, XidGenLock, ProcArrayLock, SInvalLock, FreeSpaceLock, WALInsertLock, WALWriteLock, ... Notice that the heavily contested locks (i.e. first 3 and the WAL locks) are adjacent to at least one other heavily contested lock. So they are certain to be in the same cache line and therefore to cause false sharing (on all CPU types, not just Intel and AMD (ref: Manufacturer's optimization handbooks). This could be replaced by... typedef enum LWLockId { BufMappingLock, BufMappingLock_Padding1, BufMappingLock_Padding2, BufMappingLock_Padding3, BufMappingLock_Padding4, BufMappingLock_Padding5, BufMappingLock_Padding6, BufMappingLock_Padding7, BufFreelistLock, BufFreelistLock_Padding1, BufFreelistLock_Padding2, BufFreelistLock_Padding3, BufFreelistLock_Padding4, BufFreelistLock_Padding5, BufFreelistLock_Padding6, BufFreelistLock_Padding7, LockMgrLock, LockMgrLock_Padding1, LockMgrLock_Padding2, LockMgrLock_Padding3, LockMgrLock_Padding4, LockMgrLock_Padding5, LockMgrLock_Padding6, LockMgrLock_Padding7, OidGenLock, XidGenLock, ProcArrayLock, SInvalLock, FreeSpaceLock, WALInsertLock, WALInsertLock_Padding1, WALInsertLock_Padding2, WALInsertLock_Padding3, WALInsertLock_Padding4, WALInsertLock_Padding5, WALInsertLock_Padding6, WALInsertLock_Padding7, WALWriteLock, WALWriteLock_Padding1, WALWriteLock_Padding2, WALWriteLock_Padding3, WALWriteLock_Padding4, WALWriteLock_Padding5, WALWriteLock_Padding6, WALWriteLock_Padding7, ... where the number of padding locks is determined by how many lock structures fit within a 128 byte cache line. This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a reduction in context switching. I'm not in a position to try this again now on
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
where the number of padding locks is determined by how many lock structures fit within a 128 byte cache line. This isn't exactly elegant coding, but it provides a useful improvement on an 8-way SMP box when run on 8.0 base. OK, lets be brutal: this looks pretty darn stupid. But it does follow the CPU optimization handbook advice and I did see a noticeable improvement in performance and a reduction in context switching. I'm not in a position to try this again now on 8.1beta, but I'd welcome a performance test result from anybody that is. I'll supply a patch against 8.1beta for anyone wanting to test this. I don't have an 8 way available right now but I can run tests on a 4 way Opteron if that would be helpful. Emil ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Is there a TODO here? --- Tom Lane wrote: The test case I just posted shows that our spinlock code, which we had thought largely done, is once again becoming a performance bottleneck. It's time to resurrect some of the ideas we kicked around in early 2002, and didn't pursue because we decided spinlocks weren't our major performance problem. I started investigating this by trying to understand why the futex spinlock patch had helped the Red Hat customer I mentioned, when it hadn't done anything much in last year's experiments. It turns out that this customer was using an 8-way Opteron, and the futex patch helps a lot more on that architecture than others. Some random experimentation eventually turned up part of the reason: the x86 TAS assembly code we are using is pessimal for x86_64. Specifically, s_lock.h uses this for both architectures: /* Use a non-locking test before asserting the bus lock */ __asm__ __volatile__( cmpb$0,%1\n jne1f\n lock \n xchgb%0,%1 \n 1: \n and it turns out that deleting the cmpb and jne makes things go significantly faster on Opterons. So the non-locking test is not a win at all on x86_64. As best I can tell from testing, removing it wins because it causes the rate of spinlock delays to drop significantly. Why this should be is not completely obvious. I speculate that the Opterons are capable of pulling in a copy of the cache line that contains the spinlock and then running 100 iterations of the cmpb test without ever noticing that the processor holding the lock has now released it. Without the cmpb, they are forced by the locking xchgb test to actually look at the real state of the spinlock each time. I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: it can only be a win if you expect a lot of contention for the spin lock, but in percentage terms we still have a very low conflict rate, so in most executions of the TAS macro, the cmpb is just wasted cycles. Bottom line: we definitely don't want it for x86_64, and maybe not at all, but we need more research to decide the latter. The second reason that the futex patch is helping is that when a spinlock delay does occur, it allows the delaying process to be awoken almost immediately, rather than delaying 10 msec or more as the existing code does. However, given that we are only expecting the spinlock to be held for a couple dozen instructions, using the kernel futex mechanism is huge overkill --- the in-kernel overhead to manage the futex state is almost certainly several orders of magnitude more than the delay we actually want. I looked into several other methods of doing the spinlock delay instead. I think all of these were suggested at one point or another in our earlier discussions of spinlocks: 1. Use sched_yield() if available: it does just what we want, ie, yield the processor without forcing any useless time delay before we can be rescheduled. This doesn't exist everywhere but it exists in recent Linuxen, so I tried it. It made for a beautiful improvement in my test case numbers: CPU utilization went to 100% and the context swap rate to almost nil. Unfortunately, I also saw fairly frequent stuck spinlock panics when running more queries than there were processors --- this despite increasing NUM_DELAYS to 1 in s_lock.c. So I don't trust sched_yield anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 1 yields without running the guy who's holding the spinlock.) 2. Reduce the length of the select() delays in s_lock. The current code delays in quanta of 10msec, but on recent Linuxen (and I think other platforms too) the scheduling quantum is 1msec, so we can get the processor back faster if we ask for a 1msec delay. I tried this and it is definitely a win on Linux; it doesn't seem to hurt anything on older systems either, they just round the delay up to 10msec like before. So I think we should do this, even though it's only a partial answer. 3. Modify the spin loop to do a little more work between TAS() tests. In the existing code there are only about half a dozen instructions total in the normal spin loop, most of them very fast, with the result that the spinning processor does its best to monopolize the system bus with locked probe instructions. This is obviously not good, as it'll interfere with the ability of the spinlock holder to complete its work and release the lock. (The bulk of the spinlock uses are for LWLocks, and with the current data structures the LWLock's own state is usually going to be in the same
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
The fact that cmpb isn't helping proves that getting the cache line in a read-only fashion does *not* do enough to protect the xchgb in this way. But maybe another locking instruction would. Comments? regards, tom lane Hi, Tom! Possible you help next link 'Replay: Unknown Features of the NetBurst Core' http://www.xbitlabs.com/articles/cpu/display/replay.html Best regards, Alexander Kirpa ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
I have removed this TODO item: * Research use of sched_yield() for spinlock acquisition failure --- Tom Lane wrote: Greg Stark [EMAIL PROTECTED] writes: Marko Kreen marko@l-t.ee writes: (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 1 yields without running the guy who's holding the spinlock.) Maybe it should try sched_yield once and then use select after that? I tried that, actually, but it didn't seem to offer any particular benefit. (Maybe it would have helped more on older Linux kernels before they changed sched_yield?) I'm feeling even more disenchanted with sched_yield now that Marko pointed out that the behavior was changed recently. Here we have a construct that is not portable cross-platform, does not act as documented in its man page, and the kernel guys feel free to whack its behavior around in major ways without documenting that either. It seems to be a crap-shoot whether it will be useful on any particular machine or not. At least with the select() code we can be reasonably confident we know what will happen. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend -- Bruce Momjian| http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup.| Newtown Square, Pennsylvania 19073 ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: On Sat, 17 Sep 2005, Tom Lane wrote: It'd be real interesting to see comparable numbers from some non-Linux kernels, particularly commercial systems like Solaris. Did you see the Solaris results I posted? Are you speaking of http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php ? That doesn't seem directly relevant to the point, because it's for a 2-CPU machine; so there's no way to run a test case that uses more than one but less than all the processors. In either the one or all cases, performance ought to be pretty stable regardless of whether the kernel understands about any processor asymmetries that may exist in the hardware. Not to mention that I don't know of any asymmetries in a dual SPARC anyway. We really need to test this on comparable hardware, which I guess means we need Solaris/x86 on something with hyperthreading or known NUMA asymmetry. I have access to a 4-way Opteron 852 running Solaris 10. What patches would you like me to test? -- Jim C. Nasby, Sr. Engineering Consultant [EMAIL PROTECTED] Pervasive Software http://pervasive.comwork: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Sun, 18 Sep 2005, Jim C. Nasby wrote: On Sat, Sep 17, 2005 at 01:40:28AM -0400, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: On Sat, 17 Sep 2005, Tom Lane wrote: It'd be real interesting to see comparable numbers from some non-Linux kernels, particularly commercial systems like Solaris. Did you see the Solaris results I posted? Are you speaking of http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php ? That doesn't seem directly relevant to the point, because it's for a 2-CPU machine; so there's no way to run a test case that uses more than one but less than all the processors. In either the one or all cases, performance ought to be pretty stable regardless of whether the kernel understands about any processor asymmetries that may exist in the hardware. Not to mention that I don't know of any asymmetries in a dual SPARC anyway. We really need to test this on comparable hardware, which I guess means we need Solaris/x86 on something with hyperthreading or known NUMA asymmetry. I have access to a 4-way Opteron 852 running Solaris 10. What patches would you like me to test? These ones here: http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php Gavin ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
I thought I'd throw SPARC into the equation (SPARC IIIi, in a dual SunFire v250): vanilla HEAD from ~1 week ago: bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done real1m49.037s user0m0.008s sys 0m0.016s real2m3.482s user0m0.014s sys 0m0.026s real3m54.215s user0m0.028s sys 0m0.046s With the spin-delay patch: real1m50.791s user0m0.008s sys 0m0.016s real2m0.553s user0m0.015s sys 0m0.026s real3m58.185s user0m0.027s sys 0m0.048s Padding the LWLock to 64 bytes: bash-3.00$ for i in 1 2 4; do time ./nrun.sh $i; done real1m49.594s user0m0.008s sys 0m0.016s real1m59.141s user0m0.015s sys 0m0.026s real3m56.236s user0m0.027s sys 0m0.047s Padded to 128 bytes (I'm not sure about the cache line size): real1m50.498s user0m0.008s sys 0m0.016s real2m0.169s user0m0.015s sys 0m0.026s real3m56.891s user0m0.027s sys 0m0.048s These details seem suspicious to me when compared to those we're seeing on x86 platforms. The reason is that padding the LWLock out to 64 or 128 bytes makes no difference (maybe the cache line is bigger?). The good thing is, the difference between 1 and 2 instances of the test is small -- precisely what we want. Thanks, Gavin ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom, What I think this means is that the kernel is scheduling the 2 processes onto 2 processors chosen-at-random, without awareness of whether those two processors are on the same chip (in the Xeon case) or have closer NUMA affinity (in the Opteron case). That would be consistent with my experience with HT, and the reason why many software vendors recommend disabling it. Not sure about NUMA. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Josh Berkus josh@agliodbs.com writes: Tom, What I think this means is that the kernel is scheduling the 2 processes onto 2 processors chosen-at-random, without awareness of whether those two processors are on the same chip (in the Xeon case) or have closer NUMA affinity (in the Opteron case). That would be consistent with my experience with HT, and the reason why many software vendors recommend disabling it. Not sure about NUMA. What version of linux was this test with? What you describe was certainly well known for Linux 2.4 and earlier. It was well on its way to becoming established dogma cargo-cult style. However I was under the impression that 2.6 had moved beyond that problem. It would be very interesting to know if 2.6 still suffers from this. Also, those who recommend disabling HT on this basis should realize it didn't penalize the 2 process case the same way. You may be getting a benefit only in the cases where your system isn't heavily loaded. Ie, only when it doesn't really matter. -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Greg Stark ([EMAIL PROTECTED]) wrote: However I was under the impression that 2.6 had moved beyond that problem. It would be very interesting to know if 2.6 still suffers from this. The tests on the em64t at my place were using 2.6.12. I had thought 2.6 was better about this too, but I don't have another explanation for it. CONFIG_SMT was enabled, etc.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Stephen Frost [EMAIL PROTECTED] writes: * Greg Stark ([EMAIL PROTECTED]) wrote: However I was under the impression that 2.6 had moved beyond that problem. It would be very interesting to know if 2.6 still suffers from this. The tests on the em64t at my place were using 2.6.12. I had thought 2.6 was better about this too, but I don't have another explanation for it. The 4-way Opteron I've been using at Red Hat is running 2.6.12-1.1398_FC4smp (Fedora Core 4 obviously). Red Hat in particular has been working hard in this area, and I thought that their recent kernels included NUMA fixes that weren't yet accepted upstream (at least not in the stable kernel branches). But it seems there's still a ways to go yet. It'd be real interesting to see comparable numbers from some non-Linux kernels, particularly commercial systems like Solaris. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Sat, 17 Sep 2005, Tom Lane wrote: Stephen Frost [EMAIL PROTECTED] writes: * Greg Stark ([EMAIL PROTECTED]) wrote: However I was under the impression that 2.6 had moved beyond that problem. It would be very interesting to know if 2.6 still suffers from this. The tests on the em64t at my place were using 2.6.12. I had thought 2.6 was better about this too, but I don't have another explanation for it. The 4-way Opteron I've been using at Red Hat is running 2.6.12-1.1398_FC4smp (Fedora Core 4 obviously). Red Hat in particular has been working hard in this area, and I thought that their recent kernels included NUMA fixes that weren't yet accepted upstream (at least not in the stable kernel branches). But it seems there's still a ways to go yet. It'd be real interesting to see comparable numbers from some non-Linux kernels, particularly commercial systems like Solaris. Did you see the Solaris results I posted? Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Gavin Sherry [EMAIL PROTECTED] writes: On Sat, 17 Sep 2005, Tom Lane wrote: It'd be real interesting to see comparable numbers from some non-Linux kernels, particularly commercial systems like Solaris. Did you see the Solaris results I posted? Are you speaking of http://archives.postgresql.org/pgsql-hackers/2005-09/msg00715.php ? That doesn't seem directly relevant to the point, because it's for a 2-CPU machine; so there's no way to run a test case that uses more than one but less than all the processors. In either the one or all cases, performance ought to be pretty stable regardless of whether the kernel understands about any processor asymmetries that may exist in the hardware. Not to mention that I don't know of any asymmetries in a dual SPARC anyway. We really need to test this on comparable hardware, which I guess means we need Solaris/x86 on something with hyperthreading or known NUMA asymmetry. regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote: I guess what this means is that there's no real problem with losing the cache line while manipulating the LWLock, which is what the patch was intended to prevent. Instead, we're paying for swapping two cache lines (the spinlock and the LWLock) across processors instead of just one line. But that should at worst be a 2x inflation of the time previously spent in LWLockAcquire/Release, which is surely not yet all of the application ;-). Why the heck is this so bad? Should we expect that apparently minor changes in shared data structures might be costing equivalently huge penalties in SMP performance elsewhere? Unfortunately I don't have root on the Opteron and can't run oprofile. But I'd really like to see some oprofile stats from these two cases so we can figure out what in the world is going on here. Can anyone help? I will try the patch here and see if it gives the same result. If so I could try to run with oprofile if you can give me a quick start. Best Regards, Michael Paesold ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
I wrote: I guess what this means is that there's no real problem with losing the cache line while manipulating the LWLock, which is what the patch was intended to prevent. Instead, we're paying for swapping two cache lines (the spinlock and the LWLock) across processors instead of just one line. But that should at worst be a 2x inflation of the time previously spent in LWLockAcquire/Release, which is surely not yet all of the application ;-). Why the heck is this so bad? Should we expect that apparently minor changes in shared data structures might be costing equivalently huge penalties in SMP performance elsewhere? I did some oprofile work and found that the cost seems to be because (1) there's an increase in spinlock contention (more time spent in s_lock), and (2) there's more time spent in LWLockAcquire/Release. I'm not entirely clear about the reason for (1), but (2) apparently is because of the extra cache line swapping, as posited above. In the oprofile trace it's clear that the extra cost comes exactly in the statements that touch fields of the shared LWLock, which are the places where you might have to wait to acquire a cache line. I thought for a bit that the problem might come from having chosen to put a pointer to the spinlock into each LWLock; fetching that pointer implies an additional access to the contended cache line. However, changing the data structure to a simple linear array of spinlocks didn't help. So that whole idea seems to have died a painful death. One other interesting result is that the data structure change neither helped nor hurt on an EM64T machine with 2 physical (4 logical) processors. This is also x86_64, but evidently the caching behavior is totally different from Opterons. One thing that did seem to help a little bit was padding the LWLocks to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring the array starts on a 32-byte boundary. This ensures that we won't have any LWLocks crossing cache lines --- contended access to such an LWLock would probably incur the sort of large penalty seen above, because you'd be trading two cache lines back and forth not one. It seems that the important locks are not split that way in CVS tip, because the gain wasn't much, but I wonder whether some effect like this might explain some of the unexplainable performance changes we've noticed in the past (eg, in the dbt2 results). A seemingly unrelated small change in the size of other data structures in shared memory might move things around enough to make a performance-critical lock cross a cache line boundary. On regular x86, the LWLock struct is by chance exactly 16 bytes, so there's no alignment issue. But 64-bit platforms and platforms where slock_t is bigger than char are exposed to this risk. I'm going to go ahead and make that change, since it doesn't seem likely to have any downside. It might be interesting to look into forcing proper alignment of the shared buffer headers as well. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Thu, 15 Sep 2005, Tom Lane wrote: One thing that did seem to help a little bit was padding the LWLocks to 32 bytes (by default they are 24 bytes each on x86_64) and ensuring the array starts on a 32-byte boundary. This ensures that we won't have any LWLocks crossing cache lines --- contended access to such an LWLock would probably incur the sort of large penalty seen above, because you'd be trading two cache lines back and forth not one. It seems that the important locks are not split that way in CVS tip, because the gain wasn't much, but I wonder whether some effect like this might explain some of the unexplainable performance changes we've noticed in the past (eg, in the dbt2 results). A seemingly unrelated small change in the size of other data structures in shared memory might move things around enough to make a performance-critical lock cross a cache line boundary. What about padding the LWLock to 64 bytes on these architectures. Both P4 and Opteron have 64 byte cache lines, IIRC. This would ensure that a cacheline doesn't hold two LWLocks. Gavin ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote I'm going to go ahead and make that change, since it doesn't seem likely to have any downside. It might be interesting to look into forcing proper alignment of the shared buffer headers as well. Just catching up on your mails - all of that sounds good so far. Everything mentioned so far talks about spinlocks in the general sense, rather than with respect to particular locks. The different lock types we have are held for different amounts of time and are also subject to differing amounts of contention. I think it would be useful to put in a capability to tune each class of lock according to the possibility for delay and contention on it. Long delay, low contention e.g. CheckpointLock = waiter should sleep immediately Medium delay, high contention e.g. WALWriteLock = waiter spins if at head of queue, else sleeps immediately Short delay, high contention e.g. BufMappingLock = waiter spins and retries forever, cos the lock is coming soon Short delay, low contention, sometimes long waits at end of VACUUM e.g. FreeSpaceLock = waiter spins, then eventually sleeps I'm not sure whether you'll agree with my characterisation of these locks, but the main thing I'm trying to get across is that once-size-fits-all isn't the optimal approach. I'm not saying either that we end up with individual characterisations for each lock type, but a few key ones could be catered for differently. Best Regards, Simon Riggs ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Gavin Sherry [EMAIL PROTECTED] writes: What about padding the LWLock to 64 bytes on these architectures. Both P4 and Opteron have 64 byte cache lines, IIRC. This would ensure that a cacheline doesn't hold two LWLocks. I tried that first, actually, but it was a net loss. I guess enlarging the array that much wastes too much cache space. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Thu, 15 Sep 2005, Tom Lane wrote: Gavin Sherry [EMAIL PROTECTED] writes: What about padding the LWLock to 64 bytes on these architectures. Both P4 and Opteron have 64 byte cache lines, IIRC. This would ensure that a cacheline doesn't hold two LWLocks. I tried that first, actually, but it was a net loss. I guess enlarging the array that much wastes too much cache space. Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and the cmpb/jump removed I get: [EMAIL PROTECTED] pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done real0m54.362s user0m0.003s sys 0m0.009s real1m9.788s user0m0.011s sys 0m0.013s real2m8.870s user0m0.016s sys 0m0.028s [EMAIL PROTECTED] pgsqlpad]$ for i in 1 2 4; do time ./nrun.sh $i; done real0m55.544s user0m0.006s sys 0m0.007s real1m9.313s user0m0.007s sys 0m0.018s real2m1.769s user0m0.017s sys 0m0.027s This compares to the following, which is unpadded but has cmpb/jump removed but is otherwise vanilla: 1: 55: 2: 111: 4: 207 The decrease is small, but it's there. Gavin ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Gregory Maxwell [EMAIL PROTECTED] writes: might be useful to align the structure so it always crosses two lines and measure the performance difference.. the delta could be basically attributed to the cache line bouncing since even one additional bounce would overwhelm the other performance effects from the changed alignment. Good idea. I goosed the struct declaration and setup code to arrange that the BufMappingLock's spinlock and the rest of its data were in different cache lines instead of the same one. The results (still on Red Hat's 4-way Opteron): previous best code (slock-no-cmpb and spin-delay-2): 1 31s 2 42s 4 51s 8 100s with LWLock padded to 32 bytes and correctly aligned: 1 31s 2 41s 4 51s 8 97s with LWLocks 32 bytes, but deliberately misaligned: 1 30s 2 50s 4 102s 8 200s There is no other reason than having to touch multiple cache lines for the second and third cases to be different: the array indexing code should be exactly the same. These last numbers are pretty close to what I got from the separated-spinlock patch: 1 31s 2 52s 4 106s 8 213s So it seems there's no doubt that it's the doubled cache traffic that was causing most of the problem there. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Gavin Sherry [EMAIL PROTECTED] writes: Interesting. On Xeon (2 phys, 4 log), with LWLock padded to 64 bytes and the cmpb/jump removed I get: [ 1 55s 2 69s 4 128s ] This compares to the following, which is unpadded but has cmpb/jump removed but is otherwise vanilla: 1: 55: 2: 111: 4: 207 Hmm, that's pretty significant. I tried it on a Xeon EM64T (thanks to Stephen Frost for providing access), also 2 phys 4 log, and get: Yesterday's CVS tip: 1 32s 2 46s 4 88s 8 168s plus no-cmpb and spindelay2: 1 32s 2 48s 4 100s 8 177s plus just-committed code to pad LWLock to 32: 1 33s 2 50s 4 98s 8 179s alter to pad to 64: 1 33s 2 38s 4 108s 8 180s I don't know what to make of the 2-process time going down while 4-process goes up; that seems just weird. But both numbers are repeatable. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On 9/15/05, Tom Lane [EMAIL PROTECTED] wrote: Yesterday's CVS tip: 1 32s 2 46s 4 88s 8 168s plus no-cmpb and spindelay2: 1 32s 2 48s 4 100s 8 177s plus just-committed code to pad LWLock to 32: 1 33s 2 50s 4 98s 8 179s alter to pad to 64: 1 33s 2 38s 4 108s 8 180s I don't know what to make of the 2-process time going down while 4-process goes up; that seems just weird. But both numbers are repeatable. It is odd. In the two process case there is, assuming random behavior, a 1/2 chance that you've already got the right line, but in the 4 process case only a 1/4 chance (since we're on a 4 way box). This would explain why we don't see as much cost in the intentionally misaligned case. You'd expect the a similar pattern of improvement with the 64byte alignment (some in the two process case, but more in the 4 case), but here we see more improvement in the two way case. If I had to guess I might say that the 64byte alignment is removing much of the unneeded line bouncing in the the two process case but is at the same time creating more risk of bouncing caused by aliasing. Since two processes have 1/2 chance the aliasing isn't a problem so the change is a win, but in the four process case it's no longer a win because with aliasing there is still a lot of fighting over the cache lines even if you pack well, and the decrease in packing makes odd aliasing somewhat more likely. This might also explain why the misaligned case performed so poorly in the 4process case, since the misalignment didn't just increase the cost 2x, it also increased the likelihood of a bogus bounce due to aliasing.. If this is the case, then it may be possible through very careful memory alignment to make sure that no two high contention locks that are likely to be contended at once share the same line (through either aliasing or through being directly within the same line). Then again I could be completely wrong, my understanding of multiprocessor cache coherency is very limited, and I have no clue how cache aliasing fits into it... So the above is just uninformed conjecture. ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Gregory Maxwell [EMAIL PROTECTED] writes: If I had to guess I might say that the 64byte alignment is removing much of the unneeded line bouncing in the the two process case but is at the same time creating more risk of bouncing caused by aliasing. It's an idea ... not sure if it's right or not. One thing I have noticed both on Xeon HT (2-physical-4-logical) and on Opteron (4 real processors) is that the 2-process times are significantly more variable than the 1, 4, or 8-process times: repeating the measurements shows variance around the 10% level for 2 processes but only around 1% for the others. What I think this means is that the kernel is scheduling the 2 processes onto 2 processors chosen-at-random, without awareness of whether those two processors are on the same chip (in the Xeon case) or have closer NUMA affinity (in the Opteron case). The other test cases are all symmetric and there's no way for the kernel to blow it too badly, but in the 2-process case it will be very visible whether the kernel understands the hardware or not. And the impression I have (which might be out of date) is that current Linux kernel releases are not very bright about these things. How does that tie into the point at hand about different results for 64-byte vs 32-byte LWLocks? Not sure, but I feel it's related somehow. Anyway, it'd be interesting to get some test results for these things on SMP machines running non-Linux kernels. Also, we ought to be more rigorous about reporting exactly which kernel we are using for any particular test. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote: But the cmpb instruction in the 8.0 version of TAS would have done that, and I think we've already established that the cmpb is a loss on most machines (except maybe single-physical-CPU Xeons). Note that this was a regular Pentium 4 system, not a Xeon. Best Regards, Michael Paesold ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
-Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Tom Lane Sent: 13 September 2005 23:03 To: Marko Kreen Cc: pgsql-hackers@postgresql.org Subject: Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches I'm starting to think that we might have to succumb to having a compile option optimize for multiprocessor or optimize for single processor. It's pretty hard to see how we'd alter a data structure decision like this on the fly. That would be a major pita for packagers of binary distributions. Regards, Dave. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Stephen Frost [EMAIL PROTECTED] writes: * Tom Lane ([EMAIL PROTECTED]) wrote: I'm starting to think that we might have to succumb to having a compile option optimize for multiprocessor or optimize for single processor. It's pretty hard to see how we'd alter a data structure decision like this on the fly. I'd really hate to see this happen. I suspect you're thinking too hard about this. It's not like some LWLocks would need SMP spinlocks and others uniprocessor locks. And it's not like it's going to change dynamically. Wouldn't it just be a couple: if (multiprocessor) { complex version } else { simple version } Ugly to be sure but it's not going to spread to other areas of the source base and you know the if statement isn't going to be growing any more arms. It's no uglier than doing the same thing with an #ifdef anyways. If you really really want to have two versions and you think using function pointers would impose too big a performance penalty you could play linker games. But that way lies madness. -- greg ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote: Michael Paesold [EMAIL PROTECTED] writes: To have other data, I have retested the patches on a single-cpu Intel P4 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon results it's clear that this is in reality only one cpu. While the runtime for N=1 is better than the other system, for N=4 it's already worse. The situation with the patches is quite different, though. Unfortunatly. CVS tip from 2005-09-12: 1: 36s 2: 77s (cpu ~85%)4: 159s (cpu ~98%) only slock-no-cmpb: 1: 36s 2: 81s (cpu ~79%)4: 177s (cpu ~94%) (doesn't help this time) Hm. This is the first configuration we've seen in which slock-no-cmpb was a loss. Could you double-check that result? The first tests were compiled with CFLAGS='-O2 -mcpu=pentium4 -march=pentium4'. I had redone the tests with just CFLAGS='-O2' yesterday. The difference was just about a second, but the result from the patch was the same. The results for N=4 and N=8 show the positive effect more clearly. configure: CFLAGS='-O2' --enable-casserts On RHEL 4.1, gcc (GCC) 3.4.3 20050227 (Red Hat 3.4.3-22.1) CVS tip from 2005-09-12: 1: 37s 2: 78s 4: 159s 8: 324 only slock-no-cmpb: 1: 37s 2: 82s (5%) 4: 178s (12%) 8: 362 (12%) configure: --enable-casserts (Btw. I have always done make clean ; make ; make install between tests) Best Regards, Michael Paesold I can't see any reasonable way to do runtime switching of the cmpb test --- whatever logic we put in to control it would cost as much or more than the cmpb anyway :-(. I think that has to be a compile-time choice. From my perspective it'd be acceptable to remove the cmpb only for x86_64, since only there does it seem to be a really significant win. On the other hand it seems that removing the cmpb is a net win on most x86 setups too, so maybe we should just do it and accept that there are some cases where it's not perfect. How many test cases do we have yet? Summary of the effects without the cmpb instruction seems to be: 8-way Opteron: better Dual/HT Xeon w/o EM64T: better Dual/HT EM64T: better for N=cpus, worse for Ncpus (Stephen's) HT P4 w/o EM64T: worse (stronger for Ncpus) Have I missed other reports that did test the slock-no-cmpb.patch alone? Two of the systems with positive effects are x86_64, one is an older high-end Intel x86 chip. The negative effect is on a low-cost Pentium 4 with only hyper threading. According to the mentions thread's title, this was an optimization for hyperthreading, not regular multi-cpus. We could have more data, especially on newer and high-end systems. Could some of you test the slock-no-cmpb.patch? You'll need an otherwise idle system to get repeatable results. http://archives.postgresql.org/pgsql-hackers/2005-09/msg00565.php http://archives.postgresql.org/pgsql-hackers/2005-09/msg00566.php I have re-attached the relevant files from Tom's posts because in the archive it's not clear anymore what should go into which file. See instructions in the first messages above. The patch applies to CVS tip with patch -p1 slock-no-cmpb.patch Best Regards, Michael Paesold slock-no-cmpb.patch Description: Binary data test_setup.sql Description: Binary data test_run_small.sql Description: Binary data startn.sh Description: Binary data ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Tue, Sep 13, 2005 at 08:14:47PM -0400, Stephen Frost wrote: I suppose another option would be to provide seperate packages... Could this be done as a shared library so it's more 'plug-and-play' to switch between the two? I dunno, just trying to think about how to deal with this without making it suck terribly bad for me (as a Debian user and maintainer). Note that the Linux kernel has played with moving spinlock code out of line. Due to the effect of having so many, it ended up that the memory saved by moving the code out of line actually benefitted overall. An unconditional call to a function can't be that expensive, surely. However, it would *have* to be in the same compile unit, no shared libraries. ELF imposes some overhead for calls in different compile units, using function pointers won't save you (see Ulrich Drepper paper on it). However, to make it flexible you would need a pointer to a function pointer. Fortunatly these variables won't change often so the function pointer and the function itself should be in the cache if used often enough. Finally, the kernel solves the problem by saying, if you compile for uniprocessor, optimize the code away, since a lot of the issues don't apply. My main concern is how you even detect the number of processors in a portable way... Have a nice day, -- Martijn van Oosterhout kleptog@svana.org http://svana.org/kleptog/ Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a tool for doing 5% of the work and then sitting around waiting for someone else to do the other 95% so you can sue them. pgpLWptHgoZDN.pgp Description: PGP signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom, et al., Updated, with full recompiles between everything and the new modification: N, runtime: Tip:1 31s 2 37s 4 86s 8 159s no-cmpb:1 32s 2 43s 4 83s 8 168s spin: 1 32s 2 51s 4 84s 8 160s spin+mod: 1 32s 2 51s 4 89s 8 158s spin+no-cmpb: 1 32s 2 51s 4 87s 8 163s spin+mod+no-cmpb: 1 32s 2 50s 4 86s 8 161s Unfortunately, the results don't seem to be terribly consistent between runs anyway: Run 2: Tip:1 32s 2 43s 4 87s 8 160s no-cmpb:1 31s 2 47s 4 83s 8 167s spin: 1 32s 2 52s 4 88s 8 154s spin+no-cmpb: 1 32s 2 51s 4 102s 8 166s spin+mod: 1 32s 2 53s 4 85s 8 154s spin+mod+no-cmpb: 1 32s 2 51s 4 91s 8 161s Hope it helps, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
I wrote: We could ameliorate this if there were a way to acquire ownership of the cache line without necessarily winning the spinlock. I'm imagining that we insert a dummy locked instruction just ahead of the xchgb, which touches the spinlock in such a way as to not change its state. I tried this, using this tas code: static __inline__ int tas(volatile slock_t *lock) { register slock_t _res = 1; register slock_t _dummy = 0; /* Use a locking test before trying to take the spinlock */ /* xchg implies a LOCK prefix, so no need to say LOCK for it */ __asm__ __volatile__( lock\n xaddb %2,%1 \n xchgb %0,%1 \n : +q(_res), +m(*lock), +q(_dummy) : : memory, cc); return (int) _res; } At least on Opteron, it's a loser. The previous best results (with slock-no-cmpb and spin-delay patches) were 1 31s 2 42s 4 51s 8 100s and with this instead of slock-no-cmpb, 1 33s 2 45s 4 55s 8 104s The xadd may indeed be helping in terms of protecting the xchg from end-of-timeslice --- the rate of select() delays is really tiny, one every few seconds, which is better than I saw before. But the extra cost of the extra locked operation isn't getting repaid overall. Feel free to try it on other hardware, but it doesn't look promising. BTW, I also determined that on that 4-way Opteron box, the integer modulo idea doesn't make any difference --- that is, spin-delay and what Michael called spin-delay-2 are the same speed. I think I had tried the modulo before adding the variable spin delay, and it did help in that configuration; but most likely, it was just helping by stretching out the amount of time spent looping before entering the kernel. So we can drop that idea too. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
I wrote: Another thought came to mind: maybe the current data layout for LWLocks is bad. Right now, the spinlock that protects each LWLock data struct is itself part of the struct, and since the structs aren't large (circa 20 bytes), the whole thing is usually all in the same cache line. ... Maybe it'd be better to allocate the spinlocks off by themselves. Well, this is odd. I made up a patch to do this (attached) and found that it pretty much sucks. Still the 4-way Opteron, previous best (slock-no-cmpb and spin-delay-2): 1 31s 2 42s 4 51s 8 100s with lwlock-separate added: 1 31s 2 52s 4 106s 8 213s What I had expected to see was a penalty in the single-thread case (due to instructions added to the LWLock functions), but there isn't any. I did not expect to see a factor-of-2 penalty for four threads. I guess what this means is that there's no real problem with losing the cache line while manipulating the LWLock, which is what the patch was intended to prevent. Instead, we're paying for swapping two cache lines (the spinlock and the LWLock) across processors instead of just one line. But that should at worst be a 2x inflation of the time previously spent in LWLockAcquire/Release, which is surely not yet all of the application ;-). Why the heck is this so bad? Should we expect that apparently minor changes in shared data structures might be costing equivalently huge penalties in SMP performance elsewhere? Unfortunately I don't have root on the Opteron and can't run oprofile. But I'd really like to see some oprofile stats from these two cases so we can figure out what in the world is going on here. Can anyone help? regards, tom lane *** src/backend/storage/lmgr/lwlock.c.orig Sat Aug 20 19:26:24 2005 --- src/backend/storage/lmgr/lwlock.c Wed Sep 14 11:50:09 2005 *** *** 27,35 #include storage/spin.h typedef struct LWLock { ! slock_t mutex; /* Protects LWLock and queue of PGPROCs */ boolreleaseOK; /* T if ok to release waiters */ charexclusive; /* # of exclusive holders (0 or 1) */ int shared; /* # of shared holders (0..MaxBackends) */ --- 27,44 #include storage/spin.h + /* + * Each LWLock has an associated spinlock, which we consider as locking the + * content of the LWLock struct as well as the associated queue of waiting + * PGPROCs. To minimize cache-line contention on multiprocessors, the + * spinlocks are kept physically separate from their LWLocks --- we don't + * want a spinning processor to interfere with access to the LWLock for + * the processor holding the spinlock. + */ + typedef struct LWLock { ! slock_t*mutex; /* Protects LWLock and queue of PGPROCs */ boolreleaseOK; /* T if ok to release waiters */ charexclusive; /* # of exclusive holders (0 or 1) */ int shared; /* # of shared holders (0..MaxBackends) */ *** *** 39,44 --- 48,65 } LWLock; /* + * We allocate CACHE_LINE_SIZE bytes for the fixed LWLocks' spinlocks. + * We assume that the fixed locks are few enough and heavily used enough + * that it's worth trying to put each one's spinlock in its own cache line. + * The dynamic locks are assumed to be many and less heavily hit, so they + * get just MAXALIGN space apiece. (Packing spinlocks tightly sounds like a + * bad idea on machines where slock_t is just a byte ... even for lesser-used + * spinlocks, there would be enough locks per cache line to create a + * contention issue.) + */ + #define CACHE_LINE_SIZE 64 + + /* * This points to the array of LWLocks in shared memory. Backends inherit * the pointer by fork from the postmaster. LWLockIds are indexes into * the array. *** *** 135,144 Sizesize; int numLocks = NumLWLocks(); ! /* Allocate the LWLocks plus space for shared allocation counter. */ size = mul_size(numLocks, sizeof(LWLock)); ! size = add_size(size, 2 * sizeof(int)); return size; } --- 156,174 Sizesize; int numLocks = NumLWLocks(); ! /* Space for the array of LWLock structs. */ size = mul_size(numLocks, sizeof(LWLock)); ! /* Space for dynamic allocation counter and MAXALIGN padding. */ ! size = add_size(size, 2 * sizeof(int) + MAXIMUM_ALIGNOF); ! ! /* Space for the spinlocks --- see notes for CACHE_LINE_SIZE. */ ! size = add_size(size, ! mul_size((int) NumFixedLWLocks, !Max(sizeof(slock_t), CACHE_LINE_SIZE))); !
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote: I wrote: We could ameliorate this if there were a way to acquire ownership of the cache line without necessarily winning the spinlock. I'm imagining that we insert a dummy locked instruction just ahead of the xchgb, which touches the spinlock in such a way as to not change its state. I tried this, using this tas code: ... I have tried it on the P4 w/ HT. CVS tip 1: 37s 2: 78s 4: 159s 8: 324 only slock-no-cmpb 1: 37s 2: 82s 4: 178s 8: 362 only dummy-locking 1: 44s 2: 96s 4: 197s ... I guess there is no reason to try any more. Best Regards, Michael Paesold ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: The second reason that the futex patch is helping is that when a spinlock delay does occur, it allows the delaying process to be awoken almost immediately, rather than delaying 10 msec or more as the existing code does. However, given that we are only expecting the spinlock to be held for a couple dozen instructions, using the kernel futex mechanism is huge overkill --- the in-kernel overhead to manage the futex state is almost certainly several orders of magnitude more than the delay we actually want. Why do you think so? AFAIK on uncontented case there will be no kernel access, only atomic inc/dec. On contented case you'll want task switch anyway, so the futex managing should not matter. Also this mechanism is specifically optimized for inter-process locking, I don't think you can get more efficient mechanism from side-effects from generic syscalls. If you don't want Linux-specific locking in core code, then it's another matter... 1. Use sched_yield() if available: it does just what we want, ie, yield the processor without forcing any useless time delay before we can be rescheduled. This doesn't exist everywhere but it exists in recent Linuxen, so I tried it. It made for a beautiful improvement in my test case numbers: CPU utilization went to 100% and the context swap rate to almost nil. Unfortunately, I also saw fairly frequent stuck spinlock panics when running more queries than there were processors --- this despite increasing NUM_DELAYS to 1 in s_lock.c. So I don't trust sched_yield anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 1 yields without running the guy who's holding the spinlock.) This is intended behaviour of sched_yield. http://lwn.net/Articles/31462/ http://marc.theaimsgroup.com/?l=linux-kernelm=112432727428224w=2 -- marko ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
I wrote: I'll do tomorrow morning (CEST, i.e. in about 11 hours). These are the tests with the change: if ((--spins % MAX_SPINS_PER_DELAY) == 0) to if (--spins == 0) I have called the resulting patch (spin-delay + this change) spin-delay-2. again with only slock-no-cmpb applied 1: 55s 4: 119s (cpu ~97%) with only spin-delay-2 applied 1: 56s 4: 125s (cpu ~99.5%) with slock-no-cmpb and spin-delay-2 applied 1: 56s 4: 125s (cpu ~100%) (Note that the cpu averages are not calulated but estimated by looking at vmstat.) Yesterdays results: CVS tip from 2005-09-12 ~16:00 1: 57s 2: 82s 4: 124s 8: 237s with only slock-no-cmpb.patch applied 1: 55s 2: 79s 4: 119s 8: 229s with only spin-delay.patch applied 1: 56s 2: 79s 4: 124s 8: 235s with both patches applied 1: 55s 2: 78s 4: 124s 8: 235s To have other data, I have retested the patches on a single-cpu Intel P4 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon results it's clear that this is in reality only one cpu. While the runtime for N=1 is better than the other system, for N=4 it's already worse. The situation with the patches is quite different, though. Unfortunatly. CVS tip from 2005-09-12: 1: 36s 2: 77s (cpu ~85%)4: 159s (cpu ~98%) only slock-no-cmpb: 1: 36s 2: 81s (cpu ~79%)4: 177s (cpu ~94%) (doesn't help this time) only spin-delay: 1: 36s 2: 86s (cpu =100%) 4: 157s (cpu =100%) (bad runtime for N=2 (repeatable), cpu not doing real work here?) slock-no-cmpb and spin-delay: 1: 36s 2: 106s (cpu =100%) 4: 192s (cpu =100%) (it gets worse) only spin-delay-2: 1: 36s 2: 85s (cpu =100%) 4: 160s (cpu =100%) (quite the same as spin-delay) slock-no-cmpb and spin-delay-2: 1: 36s 2: 109s (cpu =100%) 4: 198s (cpu =100%) (worse again) CS rate was low (20-50) for all tests, increasing for N2 which has to be expected. For this system I see no case for applying these patches. Is there a portable way to detect the CPU we are running on? Do you have any other idea how to implement the delays? Best Regards, Michael Paesold ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Marko Kreen marko@l-t.ee writes: (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 1 yields without running the guy who's holding the spinlock.) Maybe it should try sched_yield once and then use select after that? -- greg ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Marko Kreen marko@l-t.ee writes: On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: However, given that we are only expecting the spinlock to be held for a couple dozen instructions, using the kernel futex mechanism is huge overkill --- the in-kernel overhead to manage the futex state is almost certainly several orders of magnitude more than the delay we actually want. Why do you think so? AFAIK on uncontented case there will be no kernel access, only atomic inc/dec. In the uncontended case, we never even enter s_lock() and so the entire mechanism of yielding is irrelevant. The problem that's being exposed by these test cases is that on multiprocessors, you can see a significant rate of spinlock contention (order of 100 events/second, which is still a tiny fraction of the number of TAS calls) and our existing mechanism for dealing with contention is just not efficient enough. On contented case you'll want task switch anyway, so the futex managing should not matter. No, we DON'T want a task switch. That's the entire point: in a multiprocessor, it's a good bet that the spinlock is held by a task running on another processor, and doing a task switch will take orders of magnitude longer than just spinning until the lock is released. You should yield only after spinning long enough to make it a strong probability that the spinlock is held by a process that's lost the CPU and needs to be rescheduled. If you don't want Linux-specific locking in core code, then it's another matter... Well, it's true, we don't particularly want a one-platform solution, but if it did what we wanted we might hold our noses and use it anyway. (I think, BTW, that using futexes at the spinlock level is misguided; what would be interesting would be to see if we could substitute for both LWLock and spinlock logic with one futex-based module.) I also saw fairly frequent stuck spinlock panics when running more queries than there were processors --- this despite increasing NUM_DELAYS to 1 in s_lock.c. So I don't trust sched_yield anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 1 yields without running the guy who's holding the spinlock.) This is intended behaviour of sched_yield. http://lwn.net/Articles/31462/ http://marc.theaimsgroup.com/?l=linux-kernelm=112432727428224w=2 No; that page still says specifically So a process calling sched_yield() now must wait until all other runnable processes in the system have used up their time slices before it will get the processor again. I can prove that that is NOT what happens, at least not on a multi-CPU Opteron with current FC4 kernel. However, if the newer kernels penalize a process calling sched_yield as heavily as this page claims, then it's not what we want anyway ... regards, tom lane ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Marko Kreen marko@l-t.ee writes: On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: The second reason that the futex patch is helping is that when a spinlock delay does occur, it allows the delaying process to be awoken almost immediately, rather than delaying 10 msec or more as the existing code does. However, given that we are only expecting the spinlock to be held for a couple dozen instructions, using the kernel futex mechanism is huge overkill --- the in-kernel overhead to manage the futex state is almost certainly several orders of magnitude more than the delay we actually want. Why do you think so? AFAIK on uncontented case there will be no kernel access, only atomic inc/dec. On contented case you'll want task switch anyway, so the futex managing should not matter. Also this mechanism is specifically optimized for inter-process locking, I don't think you can get more efficient mechanism from side-effects from generic syscalls. If you don't want Linux-specific locking in core code, then it's another matter... How does the futex using code compare with the new patches on this new benchmark test? -- greg ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Greg Stark [EMAIL PROTECTED] writes: Marko Kreen marko@l-t.ee writes: (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 1 yields without running the guy who's holding the spinlock.) Maybe it should try sched_yield once and then use select after that? I tried that, actually, but it didn't seem to offer any particular benefit. (Maybe it would have helped more on older Linux kernels before they changed sched_yield?) I'm feeling even more disenchanted with sched_yield now that Marko pointed out that the behavior was changed recently. Here we have a construct that is not portable cross-platform, does not act as documented in its man page, and the kernel guys feel free to whack its behavior around in major ways without documenting that either. It seems to be a crap-shoot whether it will be useful on any particular machine or not. At least with the select() code we can be reasonably confident we know what will happen. regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane [EMAIL PROTECTED] writes: On contented case you'll want task switch anyway, so the futex managing should not matter. No, we DON'T want a task switch. That's the entire point: in a multiprocessor, it's a good bet that the spinlock is held by a task running on another processor, and doing a task switch will take orders of magnitude longer than just spinning until the lock is released. You should yield only after spinning long enough to make it a strong probability that the spinlock is held by a process that's lost the CPU and needs to be rescheduled. Does the futex code make any attempt to record the CPU of the process grabbing the lock? Clearly it wouldn't be a guarantee of anything but if it's only used for short-lived spinlocks while acquiring longer lived locks then maybe? No; that page still says specifically So a process calling sched_yield() now must wait until all other runnable processes in the system have used up their time slices before it will get the processor again. I can prove that that is NOT what happens, at least not on a multi-CPU Opteron with current FC4 kernel. However, if the newer kernels penalize a process calling sched_yield as heavily as this page claims, then it's not what we want anyway ... Well it would be no worse than select or any other random i/o syscall. It seems to me what you've found is an outright bug in the linux scheduler. Perhaps posting it to linux-kernel would be worthwhile. -- greg ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom, All: It seems to me what you've found is an outright bug in the linux scheduler. Perhaps posting it to linux-kernel would be worthwhile. For people using this on Linux 2.6, which scheduler are you using? Deadline is the recommended one for databases, and does offer significant (+5-8%) benefits in some heavy-contention environments, at least in OSDL tests. -- Josh Berkus Aglio Database Solutions San Francisco ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Josh Berkus josh@agliodbs.com writes: Tom, All: It seems to me what you've found is an outright bug in the linux scheduler. Perhaps posting it to linux-kernel would be worthwhile. For people using this on Linux 2.6, which scheduler are you using? Deadline is the recommended one for databases, and does offer significant (+5-8%) benefits in some heavy-contention environments, at least in OSDL tests. I thought 'deadline' was an I/O scheduler, not a CPU scheduler? -Doug ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Greg Stark [EMAIL PROTECTED] writes: Tom Lane [EMAIL PROTECTED] writes: No; that page still says specifically So a process calling sched_yield() now must wait until all other runnable processes in the system have used up their time slices before it will get the processor again. I can prove that that is NOT what happens, at least not on a multi-CPU Opteron with current FC4 kernel. However, if the newer kernels penalize a process calling sched_yield as heavily as this page claims, then it's not what we want anyway ... Well it would be no worse than select or any other random i/o syscall. It seems to me what you've found is an outright bug in the linux scheduler. Perhaps posting it to linux-kernel would be worthwhile. People have complained on l-k several times about the 2.6 sched_yield() behavior; the response has basically been if you rely on any particular sched_yield() behavior for synchronization, your app is broken--it's not a synchronization primitive. -Doug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Douglas McNaught [EMAIL PROTECTED] writes: It seems to me what you've found is an outright bug in the linux scheduler. Perhaps posting it to linux-kernel would be worthwhile. People have complained on l-k several times about the 2.6 sched_yield() behavior; the response has basically been if you rely on any particular sched_yield() behavior for synchronization, your app is broken--it's not a synchronization primitive. They're not talking about this. They're talking about applications that spin on sched_yield() and expect it to reduce cpu load as if the process were calling sleep(). What Tom found was that some processes are never scheduled when sched_yield is called. There's no reason that should be happening. -- greg ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Greg Stark [EMAIL PROTECTED] writes: What Tom found was that some processes are never scheduled when sched_yield is called. There's no reason that should be happening. Yeah, that would probably be a bug... -Doug ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Tue, 13 Sep 2005 12:21:45 -0400 Douglas McNaught [EMAIL PROTECTED] wrote: Josh Berkus josh@agliodbs.com writes: Tom, All: It seems to me what you've found is an outright bug in the linux scheduler. Perhaps posting it to linux-kernel would be worthwhile. For people using this on Linux 2.6, which scheduler are you using? Deadline is the recommended one for databases, and does offer significant (+5-8%) benefits in some heavy-contention environments, at least in OSDL tests. I thought 'deadline' was an I/O scheduler, not a CPU scheduler? Yes, that's correct. That's an i/o elevator algorithm. Mark ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Douglas McNaught [EMAIL PROTECTED] writes: Greg Stark [EMAIL PROTECTED] writes: What Tom found was that some processes are never scheduled when sched_yield is called. There's no reason that should be happening. Yeah, that would probably be a bug... I suspect the kernel hackers might consider it deliberate in a NUMA-aware kernel. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: I'm feeling even more disenchanted with sched_yield now that Marko pointed out that the behavior was changed recently. Here we have a To be fair, I'm not entirely sure 'recently' is quite the right word. It sounds like it changed during the 2.5 development cycle; which was between major kernel releases, and looks like it was around Dec, 2003. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Michael Paesold [EMAIL PROTECTED] writes: To have other data, I have retested the patches on a single-cpu Intel P4 3GHz w/ HT (i.e. 2 virtual cpus), no EM64T. Comparing to the 2,4 dual-Xeon results it's clear that this is in reality only one cpu. While the runtime for N=1 is better than the other system, for N=4 it's already worse. The situation with the patches is quite different, though. Unfortunatly. CVS tip from 2005-09-12: 1: 36s 2: 77s (cpu ~85%)4: 159s (cpu ~98%) only slock-no-cmpb: 1: 36s 2: 81s (cpu ~79%)4: 177s (cpu ~94%) (doesn't help this time) Hm. This is the first configuration we've seen in which slock-no-cmpb was a loss. Could you double-check that result? Is there a portable way to detect the CPU we are running on? Do you have any other idea how to implement the delays? I can't see any reasonable way to do runtime switching of the cmpb test --- whatever logic we put in to control it would cost as much or more than the cmpb anyway :-(. I think that has to be a compile-time choice. From my perspective it'd be acceptable to remove the cmpb only for x86_64, since only there does it seem to be a really significant win. On the other hand it seems that removing the cmpb is a net win on most x86 setups too, so maybe we should just do it and accept that there are some cases where it's not perfect. Looking back at the original discussion, I see that the cmpb was added with little discussion and no specific performance testing, so it shouldn't be presumed to have a lot of weight behind it: http://archives.postgresql.org/pgsql-patches/2003-12/msg00352.php The other thing that's going on in the patch is dynamic adjustment of spins_per_delay, and that could certainly be made much smarter than it is now. For instance, it'd cost almost nothing to make the lower and upper bounds for spins_per_delay be variables instead of constants. We could set the bounds during postmaster startup based on some criteria yet to be determined. regards, tom lane ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Tue, Sep 13, 2005 at 10:10:13AM -0400, Tom Lane wrote: Marko Kreen marko@l-t.ee writes: On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: However, given that we are only expecting the spinlock to be held for a couple dozen instructions, using the kernel futex mechanism is huge overkill --- the in-kernel overhead to manage the futex state is almost certainly several orders of magnitude more than the delay we actually want. Why do you think so? AFAIK on uncontented case there will be no kernel access, only atomic inc/dec. In the uncontended case, we never even enter s_lock() and so the entire mechanism of yielding is irrelevant. The problem that's being exposed by these test cases is that on multiprocessors, you can see a significant rate of spinlock contention (order of 100 events/second, which is still a tiny fraction of the number of TAS calls) and our existing mechanism for dealing with contention is just not efficient enough. On contented case you'll want task switch anyway, so the futex managing should not matter. No, we DON'T want a task switch. That's the entire point: in a multiprocessor, it's a good bet that the spinlock is held by a task running on another processor, and doing a task switch will take orders of magnitude longer than just spinning until the lock is released. You should yield only after spinning long enough to make it a strong probability that the spinlock is held by a process that's lost the CPU and needs to be rescheduled. Hmm. I guess this could be separated into 2 cases: 1. Light load - both lock owner and lock requester wont get scheduled while busy (owner in critical section, waiter spinning.) 2. Big load - either or both of them gets scheduled while busy. (waiter is scheduled by OS or voluntarily by eg. calling select()) So my impression is that currently you optimize for #1 at the expense of #2, while with futexes you'd optimize for #2 at the expense of #1. Additionally I'm pretty convinced that futexes give you most efficient implementation for #2, as kernel knows what processes are waiting on particular lock so it can make best decisions for scheduling. If you don't want Linux-specific locking in core code, then it's another matter... Well, it's true, we don't particularly want a one-platform solution, but if it did what we wanted we might hold our noses and use it anyway. (I think, BTW, that using futexes at the spinlock level is misguided; what would be interesting would be to see if we could substitute for both LWLock and spinlock logic with one futex-based module.) Use pthreads ;) I also saw fairly frequent stuck spinlock panics when running more queries than there were processors --- this despite increasing NUM_DELAYS to 1 in s_lock.c. So I don't trust sched_yield anymore. Whatever it's doing in Linux 2.6 isn't what you'd expect. (I speculate that it's set up to only yield the processor to other processes already affiliated to that processor. In any case, it is definitely capable of getting through 1 yields without running the guy who's holding the spinlock.) This is intended behaviour of sched_yield. http://lwn.net/Articles/31462/ http://marc.theaimsgroup.com/?l=linux-kernelm=112432727428224w=2 No; that page still says specifically So a process calling sched_yield() now must wait until all other runnable processes in the system have used up their time slices before it will get the processor again. I can prove that that is NOT what happens, at least not on a multi-CPU Opteron with current FC4 kernel. However, if the newer kernels penalize a process calling sched_yield as heavily as this page claims, then it's not what we want anyway ... My fault. As I saw that there is problem with sched_yield, I said I bet this is because of behaviour change and only skimmed the exact details. But the point that sched_yield is not meant for such usage still stands. About fast yielding, comment on sys_sched_yield() says: * sys_sched_yield - yield the current processor to other threads. * * this function yields the current CPU by moving the calling thread * to the expired array. If there are no other threads running on this * CPU then this function will return. So there just is nothing else to schedule on that CPU. -- marko ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Marko Kreen marko@l-t.ee writes: Hmm. I guess this could be separated into 2 cases: 1. Light load - both lock owner and lock requester wont get scheduled while busy (owner in critical section, waiter spinning.) 2. Big load - either or both of them gets scheduled while busy. (waiter is scheduled by OS or voluntarily by eg. calling select()) Don't forget that the coding rules for our spinlocks say that you mustn't hold any such lock for more than a couple dozen instructions, and certainly any kernel call while holding the lock is Right Out. There is *no* case where the holder of a spinlock is going to voluntarily give up the CPU. The design intention was that the odds of losing the CPU while holding a spinlock would be negligibly small, simply because we don't hold it very long. About fast yielding, comment on sys_sched_yield() says: * sys_sched_yield - yield the current processor to other threads. * * this function yields the current CPU by moving the calling thread * to the expired array. If there are no other threads running on this * CPU then this function will return. Mph. So that's pretty much exactly what I suspected... I just had a thought: it seems that the reason we are seeing a significant issue here is that on SMP machines, the cost of trading exclusively-owned cache lines back and forth between processors is so high that the TAS instructions (specifically the xchgb, in the x86 cases) represent a significant fraction of backend execution time all by themselves. (We know this is the case due to oprofile results, see discussions from last April.) What that means is that there's a fair chance of a process losing its timeslice immediately after the xchgb. Which is precisely the scenario we do not want, if the process successfully acquired the spinlock by means of the xchgb. We could ameliorate this if there were a way to acquire ownership of the cache line without necessarily winning the spinlock. I'm imagining that we insert a dummy locked instruction just ahead of the xchgb, which touches the spinlock in such a way as to not change its state. (xchgb won't do for this, but maybe one of the other lockable instructions will.) We do the xchgb just after this one. The idea is that if we don't own the cache line, the first instruction causes it to be faulted into the processor's cache, and if our timeslice expires while that is happening, we lose the processor without having acquired the spinlock. This assumes that once we've got the cache line, the xchgb that actually does the work can get executed with not much extra time spent and only low probability of someone else stealing the cache line back first. The fact that cmpb isn't helping proves that getting the cache line in a read-only fashion does *not* do enough to protect the xchgb in this way. But maybe another locking instruction would. Comments? regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
I wrote: We could ameliorate this if there were a way to acquire ownership of the cache line without necessarily winning the spinlock. Another thought came to mind: maybe the current data layout for LWLocks is bad. Right now, the spinlock that protects each LWLock data struct is itself part of the struct, and since the structs aren't large (circa 20 bytes), the whole thing is usually all in the same cache line. This had seemed like a good idea at the time, on the theory that once you'd obtained the spinlock you'd also have pulled in the LWLock contents. But that's only optimal under an assumption of low contention. If there's high contention for the spinlock, then another processor spinning on the lock will be continuously taking away ownership of the cache line and thus slowing down the guy who's got the lock and is trying to examine/update the LWLock contents. Maybe it'd be better to allocate the spinlocks off by themselves. Then, spinning processors would not affect the processor that's updating the LWLock; only when it finishes doing that and needs to clear the spinlock will it have to contend with the spinners for the cache line containing the spinlock. This would add an instruction or so to LWLockAcquire and LWLockRelease, and would be of no benefit on uniprocessors, but it might be worth doing for multiprocessors. Another patch to test ... I'm starting to think that we might have to succumb to having a compile option optimize for multiprocessor or optimize for single processor. It's pretty hard to see how we'd alter a data structure decision like this on the fly. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Tue, 13 Sep 2005 Tom Lane wrote : I wrote: We could ameliorate this if there were a way to acquire ownership of the cache line without necessarily winning the spinlock. Another thought came to mind: maybe the current data layout for LWLocks is bad. Right now, the spinlock that protects each LWLock data struct is itself part of the struct, and since the structs aren't large (circa 20 bytes), the whole thing is usually all in the same cache line. This had seemed like a good idea at the time, on the theory that once you'd obtained the spinlock you'd also have pulled in the LWLock contents. But that's only optimal under an assumption of low contention. If there's high contention for the spinlock, then another processor spinning on the lock will be continuously taking away ownership of the cache line and thus slowing down the guy who's got the lock and is trying to examine/update the LWLock contents. If this were the case, perhaps first fetch the spin lock with read-only permission should have helped. Modern processors have store buffers to not let a store miss to slow down the processor. Therefore, if processor A has the spin lock and it is examining and updating the LWLock on the the same cache line, as long as processor B doesn't try to write the same cache line, processor A won't be slowed down. What would happen is that multiple writing are coalescing in the write buffer and they are going to be flushed to the memory when processor A regain the write permission. I still think your first scenario is possible. That's when final processor B gets the lock, its time slice runs out. This sounds to me is likely to cause a chain of context switches. Maybe it'd be better to allocate the spinlocks off by themselves. Then, spinning processors would not affect the processor that's updating the LWLock; only when it finishes doing that and needs to clear the spinlock will it have to contend with the spinners for the cache line containing the spinlock. This would add an instruction or so to LWLockAcquire and LWLockRelease, and would be of no benefit on uniprocessors, but it might be worth doing for multiprocessors. Another patch to test ... I'm starting to think that we might have to succumb to having a compile option optimize for multiprocessor or optimize for single processor. It's pretty hard to see how we'd alter a data structure decision like this on the fly. regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: I'm starting to think that we might have to succumb to having a compile option optimize for multiprocessor or optimize for single processor. It's pretty hard to see how we'd alter a data structure decision like this on the fly. I'd really hate to see this happen. In this case I don't think the change you're proposing would have much of an impact on a uniprocessor machine. Having seperate compile-time options for uniprocessor and multiprocessor would open the gates for potentially other changes which *would* have a more serious impact on one or the other when compiled for the opposite. I think this would be a serious problem for binary distributions and correspondingly their users. Happy to test these changes, btw. Also, I'm redoing my tests from the other patches with all the various combinations; I'm a bit concerned that the recompiles I did (which were just 'make', not 'make clean', 'make') for them previously didn't actually recompile everything necessary. Sorry if the results differ. Hope to have that all done this evening. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Tue, 13 Sep 2005, Stephen Frost wrote: * Tom Lane ([EMAIL PROTECTED]) wrote: I'm starting to think that we might have to succumb to having a compile option optimize for multiprocessor or optimize for single processor. It's pretty hard to see how we'd alter a data structure decision like this on the fly. I'd really hate to see this happen. In this case I don't think the change you're proposing would have much of an impact on a uniprocessor machine. Having seperate compile-time options for uniprocessor and multiprocessor would open the gates for potentially other changes which *would* have a more serious impact on one or the other when compiled for the opposite. I think this would be a serious problem for binary distributions and correspondingly their users. It does make it painful for distribution/package maintainers but I think the potential benefits for single/multi-CPU architectures are high. It means that our lock intrinsic on uniprocessors can just be a lock/delay loop without any spinning -- which is a complete waste of time if you only have one CPU. Doing this at runtime involvevs some pretty significant uglification of low level code I think. Gavin ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Gavin Sherry ([EMAIL PROTECTED]) wrote: It does make it painful for distribution/package maintainers but I think the potential benefits for single/multi-CPU architectures are high. It means that our lock intrinsic on uniprocessors can just be a lock/delay loop without any spinning -- which is a complete waste of time if you only have one CPU. Doing this at runtime involvevs some pretty significant uglification of low level code I think. I suspect distributors would go for the multi-cpu setup (especially if a uniprocessor build is *broken* for multiprocessor) and then in a lot of cases you end up not actually getting any benefit. I'm afraid you'd also end up having to tell alot of people who complain to recompile, who will then complain back to their distributors, etc. I suppose another option would be to provide seperate packages... Could this be done as a shared library so it's more 'plug-and-play' to switch between the two? I dunno, just trying to think about how to deal with this without making it suck terribly bad for me (as a Debian user and maintainer). Certainly makes me wish there was a way to do it which made the kernel handle the uniprocessor/multiprocessor question and did locking as appropriate for those. Ah, well. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Min Xu (Hsu) [EMAIL PROTECTED] writes: ...If this were the case, perhaps first fetch the spin lock with read-only permission should have helped. But the cmpb instruction in the 8.0 version of TAS would have done that, and I think we've already established that the cmpb is a loss on most machines (except maybe single-physical-CPU Xeons). I suggested in my other message that it might help to grab write permission on the cache line before actually trying to acquire the spin lock --- but I don't see how getting read-only permission first will help. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Stephen Frost [EMAIL PROTECTED] writes: I suspect distributors would go for the multi-cpu setup (especially if a uniprocessor build is *broken* for multiprocessor) and then in a lot of cases you end up not actually getting any benefit. I'm afraid you'd also end up having to tell alot of people who complain to recompile, who will then complain back to their distributors, etc. Yeah. Being in charge of Red Hat's packaging of PG, I feel that pain as keenly as anybody ... and I *know* RH will not be interested in shipping two different packages. If we go this way, the RH distributions will use the --optimize-multi switch, because that's where the money is. The bottom line here is that we will have to make some compromises: if we want one-size-fits-all code, it will not be optimal for every single architecture. If we don't do one-size-fits-all, then we will pay for it in various other ways. regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Tue, 13 Sep 2005 Tom Lane wrote : Min Xu (Hsu) [EMAIL PROTECTED] writes: ...If this were the case, perhaps first fetch the spin lock with read-only permission should have helped. But the cmpb instruction in the 8.0 version of TAS would have done that, and I think we've already established that the cmpb is a loss on most machines (except maybe single-physical-CPU Xeons). I suggested in my other message that it might help to grab write permission on the cache line before actually trying to acquire the spin lock --- but I don't see how getting read-only permission first will help. Yes, I agree. What I was trying to say was that if the second scenario you hypothesize were true, i.e. fetching write-able line before actually trying to acquire the spin lock would cause another processor to slow down its execution inside the critical section, then fetching read-only lines would have helped. As you said, however, experimental results shows fetching read-only lines didn't help, which led me wonder whether the second scenario your described was really happening. ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Min Xu (Hsu) [EMAIL PROTECTED] writes: ... As you said, however, experimental results shows fetching read-only lines didn't help, which led me wonder whether the second scenario your described was really happening. I don't know --- we haven't tried it. I do intend to work up some patches for these ideas, and then we can test them. Or, if you happen to be wide awake at the moment, feel free to take a shot at it ... I'm out of steam and won't get to it before tomorrow... regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Tue, 13 Sep 2005 Tom Lane wrote : Min Xu (Hsu) [EMAIL PROTECTED] writes: ... As you said, however, experimental results shows fetching read-only lines didn't help, which led me wonder whether the second scenario your described was really happening. I don't know --- we haven't tried it. I do intend to work up some patches for these ideas, and then we can test them. I see. I look forward to it. Or, if you happen to be wide awake at the moment, feel free to take a shot at it ... I'm out of steam and won't get to it before tomorrow... I am newbie here. Please don't let my comments discourage you from doing anything. I enjoyed very much following your discussion and analysis on this problem. ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote: Comments and testing invited. I have tested the patches on a Dual Xeon 2,4 GHz w/ HT (no EM64T). (Configured with CFLAGS='-O2 -mcpu=pentium4 -march=pentium4' --enable-casserts). The results were pretty stable (around .2 seconds). I would not trust the numbers for N=2, linux, at least 2.4 is not good at not scheduling two running processes on two different HTs on the same core. Those values also had the most variance ( 1s). All other measures were quite stable over several runs. CVS tip from 2005-09-12 ~16:00 1: 57s 2: 82s 4: 124s 8: 237s with only slock-no-cmpb.patch applied 1: 55s 2: 79s 4: 119s 8: 229s with only spin-delay.patch applied 1: 56s 2: 79s 4: 124s 8: 235s with both patches applied 1: 55s 2: 78s 4: 124s 8: 235s compare to 7.4.8 on the same machine ;-) 1: 92s 2: 235s 4: 474s 8: did not try ... It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch does not really help much on this machine. That seems to match Stephen Frost's results with EM64T, if I read them correctly. The cs rate is about 150 on CVS tip without patches and below 100 with the patches (all three cases). With 7.4.8 its 23-28 with N1. 8.1 is clearly the winner here. Great work, Tom. I hope some more data helps. Best Regards, Michael Paesold ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Michael Paesold [EMAIL PROTECTED] writes: It seems to me the slock-no-cmpb is a win in any case. The spin-delay patch does not really help much on this machine. That seems to match Stephen Frost's results with EM64T, if I read them correctly. Yeah, it's interesting that you both see slock-no-cmpb as saving some cycles and the second patch as giving them back. I wonder whether the integer-modulo-to-slow-the-loop trick is counterproductive on your machines. You both seem to be using hardware that will recognize rep;nop and maybe that's all that's needed. I probably should have broken down the spindelay patch into multiple components. But it's only a small change --- could you try simplifying the patched line if ((--spins % MAX_SPINS_PER_DELAY) == 0) to if (--spins == 0) and see how the patch does that way? regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote: I probably should have broken down the spindelay patch into multiple components. But it's only a small change --- could you try simplifying the patched line if ((--spins % MAX_SPINS_PER_DELAY) == 0) to if (--spins == 0) and see how the patch does that way? I'll do tomorrow morning (CEST, i.e. in about 11 hours). Best Regards, Michael Paesold ---(end of broadcast)--- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
I wrote: ... and see how the patch does that way? BTW, please do look at vmstat 1 while running the test case corresponding to your number of processors. It's hard to tell from the runtime alone whether the patch is fully accomplishing its goal of reducing wasted cycles. If you see user CPU percent go to 100 and context swaps drop to a low value, though, you know it's working. regards, tom lane ---(end of broadcast)--- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: it can only be a win if you expect a lot of contention for the spin lock, but in percentage terms we still have a very low conflict rate, so in most executions of the TAS macro, the cmpb is just wasted cycles. Bottom line: we definitely don't want it for x86_64, and maybe not at all, but we need more research to decide the latter. I think an important question is wether this is for x86_64 in general, of opteron specific. It could be that it's not the same on Intel's EM64Ts. One reason this might behave differently for opterons is that it's a cc-NUMA instead of the normal SMP. Something else to consider is the OS you're using. I've been told that Linux isn't that good in NUMA and FreeBSD might be better. Kurt ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Kurt Roeckx [EMAIL PROTECTED] writes: On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: I think an important question is wether this is for x86_64 in general, of opteron specific. It could be that it's not the same on Intel's EM64Ts. Good point --- anyone have one to try? Something else to consider is the OS you're using. I've been told that Linux isn't that good in NUMA and FreeBSD might be better. It's hard to see how the OS could affect behavior at the level of processor cache operations --- unless they did something truly spectacularly stupid, like mark main memory non-cacheable. regards, tom lane ---(end of broadcast)--- TIP 5: don't forget to increase your free space map settings
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: Kurt Roeckx [EMAIL PROTECTED] writes: On Sun, Sep 11, 2005 at 05:59:49PM -0400, Tom Lane wrote: I kinda suspect that the cmpb test is a no-op or loss on all Intelish processors: I think an important question is wether this is for x86_64 in general, of opteron specific. It could be that it's not the same on Intel's EM64Ts. Good point --- anyone have one to try? I've got one I can test on. I need to upgrade the kernel and some other things on it though (it's running 2.6.8 atm, and an older Debian/unstable which I should probably bring up to current). I'll work on it starting now and post results once I get some. Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: My proposal therefore is to do #2, #3, and #4, and to modify the TAS assembly code at least on x86_64. Together, these changes bring my test case on a 4-way Opteron from N, runtime: 1 36s 2 61s 4 105s 8 198s em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: N, runtime: 1 31s 2 47s 4 86s 8 159s to N, runtime: 1 35s 2 48s 4 58s 8 105s N, runtime: 1 32s 2 53s 4 90s 8 169s CPU utilization is definitely higher when running with the patch though. Hope this helps, happy to do additional testing if necessary. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Stephen Frost [EMAIL PROTECTED] writes: em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: N, runtime: 1 31s 2 47s 4 86s 8 159s N, runtime: 1 32s 2 53s 4 90s 8 169s Er, which (or both) of the two patches did you apply here? regards, tom lane ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: Stephen Frost [EMAIL PROTECTED] writes: em64t, 2 proc + 2 HT, 3.4ghz, 4G, 2.6.12: N, runtime: 1 31s 2 47s 4 86s 8 159s N, runtime: 1 32s 2 53s 4 90s 8 169s Er, which (or both) of the two patches did you apply here? Applied both, sorry that wasn't clear. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Stephen Frost [EMAIL PROTECTED] writes: * Tom Lane ([EMAIL PROTECTED]) wrote: Er, which (or both) of the two patches did you apply here? Applied both, sorry that wasn't clear. Thanks. If you've got the time, could you try the two patches separately and see what you get? regards, tom lane ---(end of broadcast)--- TIP 2: Don't 'kill -9' the postmaster
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane [EMAIL PROTECTED] wrote My proposal therefore is to do #2, #3, and #4, and to modify the TAS assembly code at least on x86_64. Together, these changes bring my test case on a 4-way Opteron from Some changes are based on tests and heuristics, so can we make them into the configure script so the choice could be made automatically? Regards, Qingqing ---(end of broadcast)--- TIP 1: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Tom Lane ([EMAIL PROTECTED]) wrote: Stephen Frost [EMAIL PROTECTED] writes: * Tom Lane ([EMAIL PROTECTED]) wrote: Er, which (or both) of the two patches did you apply here? Applied both, sorry that wasn't clear. Thanks. If you've got the time, could you try the two patches separately and see what you get? Sure. CVS Head: N, runtime: 1 31s 2 47s 4 86s 8 159s With just slock-no-cmpb.patch: N, runtime: 1 32s 2 39s 4 82s 8 167s With just spin-delay.patch N, runtime: 1 32s 2 52s 4 94s 8 164s With both: N, runtime: 1 32s 2 53s 4 90s 8 169s Hope that helps, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane [EMAIL PROTECTED] writes: Something else to consider is the OS you're using. I've been told that Linux isn't that good in NUMA and FreeBSD might be better. It's hard to see how the OS could affect behavior at the level of processor cache operations --- unless they did something truly spectacularly stupid, like mark main memory non-cacheable. Well it could schedule processes on processors in ways that force less than optimal memory usage patterns. But maybe you should tell the Altix folk with their 32-processor 384Gb NUMA machines what you've been told about Linux not being that good in NUMA. Really, these kind of cargo cult anecdotes are pretty pointless. -- greg ---(end of broadcast)--- TIP 6: explain analyze is your friend
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
* Stephen Frost ([EMAIL PROTECTED]) wrote: Thanks. If you've got the time, could you try the two patches separately and see what you get? Sure. [...] Just to be clear- this was from a completely default 'make install' using the Debian configure options from 8.0.3 (which aren't that particularly interesting really- nls, integer-datetimes, debug, disable-rpath, tcl, perl, python, pam, krb5, openssl, gnu-ld, enable-thread-safety). If it'd be useful for the test I can adjust whichever parameters are appropriate (work_mem, cache_size, etc). There's absolutely nothing else going on except for these test, a few shells, top, etc, on the box. Enjoy, Stephen signature.asc Description: Digital signature
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Qingqing Zhou [EMAIL PROTECTED] writes: Some changes are based on tests and heuristics, so can we make them into the configure script so the choice could be made automatically? It's a bit premature to propose that, when we don't yet know if the suggested changes are a win or loss under any conditions, let alone how to automatically tell the difference. Right now we are in the find-out-the-facts mode ... regards, tom lane ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org
Re: [HACKERS] Spinlocks, yet again: analysis and proposed patches
Tom Lane wrote: I attach two proposed patches: the first removes the cmpb/jne from the x86 TAS assembly code, and the second one does the s_lock changes enumerated as points #2, #3, #4. The first one in particular needs more testing to see if it hurts performance on any non-Opteron x86 chips. (If so, we'd just make it conditional to x86_64.) 2x PIII 1G 2G Freebsd 6.0Beta4 8.1beta1 (2005-08-28): N runtime: 1 85s 2 139s 4 220s 8.1beta1 (2005-08-28) + patch 1 (s_lock.h only) N runtime: 1 89s 2 137s 4 229s 8.1beta1 (2005-08-28) + patch 2 N runtime: 1 84s 2 108s 4 214s Observe the interesting little speed improvement for patch 2 with 2 processes (seems to be repeatable). Let me know if you want to see vmstat output for any of these. regards Mark ---(end of broadcast)--- TIP 4: Have you searched our list archives? http://archives.postgresql.org