Re: [HACKERS] measuring spinning
On Mon, Jun 18, 2012 at 9:36 PM, Robert Haas robertmh...@gmail.com wrote: On Sat, Jun 16, 2012 at 6:25 PM, Jeff Janes jeff.ja...@gmail.com wrote: Well, this fell through the cracks, because I forgot to add it to the January CommitFest. Here it is again, rebased. This applies and builds cleanly and passes make check (under enable-cassert). Not test or docs are needed for a patch of this nature. It does what it says, and we want it. I wondered if the change in the return signature of s_lock would have an affect on performance. So I've run a series of pgbench -T 30 -P -c8 -j8, at a scale of 30 which fits in shared_buffers, using an Amazon c1.xlarge (8 cores). I ran both HEAD, and HEAD+patch (without LWLOCK_STATS in both cases), in random ordering. The patch was 0.37% slower, average 298483 selects per second patched to 299582 HEAD. The difference is probably real (p value 0.042, one sided.) but is also pretty much negligible and could just be due to where the executable code falls in the cache lines which could move around with other changes to the code. I found this a bit surprising, so I retested on the IBM POWER7 box at concurrencies from 1 to 64 and found some test runs faster and some test runs slower - maybe a few more faster than slower. I could do a more exact analysis, but I'm inclined to think that if there is an effect here, it's probably just in the noise, and that the real effect you measured was, as you say, the result of cache-line shifting or some other uninteresting artifact that could just as easily move back the other way on some other commit. Really, s_lock should not be getting called often enough to matter much, and ignoring the return value of that function shouldn't cost anyone much. Two suggestions: In your original email you say number of pg_usleep() calls that are required to acquire each LWLock, but nothing in the code says this. Just reading lwlock.c I would naively assume it is reporting the number of TAS spins, not the number of spin-delays (and in fact that is what I did assume until I read your email more carefully). A comment somewhere in lwlock.c would be helpful. Yeah, or maybe I should just change the printout to say spindelay instead of spin, and rename the variables likewise. Also in lwlock.c, if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i] || spin_counts[i]) I don't think we can have spins (or blocks, for that matter) unless we have some acquires to have caused them, so the last two tests in that line seem to be noise. Perhaps so, but I think it's probably safe and clear to just follow the existing code pattern. Since my suggestions are minor, should I go ahead and mark this ready for committer? If you think it should be committed, yes. So Jeff did that, and I've now committed the patch. Thanks for the review. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Sat, Jun 16, 2012 at 6:25 PM, Jeff Janes jeff.ja...@gmail.com wrote: Well, this fell through the cracks, because I forgot to add it to the January CommitFest. Here it is again, rebased. This applies and builds cleanly and passes make check (under enable-cassert). Not test or docs are needed for a patch of this nature. It does what it says, and we want it. I wondered if the change in the return signature of s_lock would have an affect on performance. So I've run a series of pgbench -T 30 -P -c8 -j8, at a scale of 30 which fits in shared_buffers, using an Amazon c1.xlarge (8 cores). I ran both HEAD, and HEAD+patch (without LWLOCK_STATS in both cases), in random ordering. The patch was 0.37% slower, average 298483 selects per second patched to 299582 HEAD. The difference is probably real (p value 0.042, one sided.) but is also pretty much negligible and could just be due to where the executable code falls in the cache lines which could move around with other changes to the code. I found this a bit surprising, so I retested on the IBM POWER7 box at concurrencies from 1 to 64 and found some test runs faster and some test runs slower - maybe a few more faster than slower. I could do a more exact analysis, but I'm inclined to think that if there is an effect here, it's probably just in the noise, and that the real effect you measured was, as you say, the result of cache-line shifting or some other uninteresting artifact that could just as easily move back the other way on some other commit. Really, s_lock should not be getting called often enough to matter much, and ignoring the return value of that function shouldn't cost anyone much. Two suggestions: In your original email you say number of pg_usleep() calls that are required to acquire each LWLock, but nothing in the code says this. Just reading lwlock.c I would naively assume it is reporting the number of TAS spins, not the number of spin-delays (and in fact that is what I did assume until I read your email more carefully). A comment somewhere in lwlock.c would be helpful. Yeah, or maybe I should just change the printout to say spindelay instead of spin, and rename the variables likewise. Also in lwlock.c, if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i] || spin_counts[i]) I don't think we can have spins (or blocks, for that matter) unless we have some acquires to have caused them, so the last two tests in that line seem to be noise. Perhaps so, but I think it's probably safe and clear to just follow the existing code pattern. Since my suggestions are minor, should I go ahead and mark this ready for committer? If you think it should be committed, yes. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Thu, Jun 14, 2012 at 2:39 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jan 11, 2012 at 8:48 PM, Robert Haas robertmh...@gmail.com wrote: I've had cause, a few times this development cycle, to want to measure the amount of spinning on each lwlock in the system. To that end, I've found the attached patch useful. Note that if you don't define LWLOCK_STATS, this changes nothing except that the return value from s_lock becomes int rather than void. If you do define LWLOCK_STATS, then LWLockAcquire() counts the number of pg_usleep() calls that are required to acquire each LWLock, in addition to the other statistics. Since this has come up for me a few times now, I'd like to proposing including it in core. Well, this fell through the cracks, because I forgot to add it to the January CommitFest. Here it is again, rebased. This applies and builds cleanly and passes make check (under enable-cassert). Not test or docs are needed for a patch of this nature. It does what it says, and we want it. I wondered if the change in the return signature of s_lock would have an affect on performance. So I've run a series of pgbench -T 30 -P -c8 -j8, at a scale of 30 which fits in shared_buffers, using an Amazon c1.xlarge (8 cores). I ran both HEAD, and HEAD+patch (without LWLOCK_STATS in both cases), in random ordering. The patch was 0.37% slower, average 298483 selects per second patched to 299582 HEAD. The difference is probably real (p value 0.042, one sided.) but is also pretty much negligible and could just be due to where the executable code falls in the cache lines which could move around with other changes to the code. Two suggestions: In your original email you say number of pg_usleep() calls that are required to acquire each LWLock, but nothing in the code says this. Just reading lwlock.c I would naively assume it is reporting the number of TAS spins, not the number of spin-delays (and in fact that is what I did assume until I read your email more carefully). A comment somewhere in lwlock.c would be helpful. Also in lwlock.c, if (sh_acquire_counts[i] || ex_acquire_counts[i] || block_counts[i] || spin_counts[i]) I don't think we can have spins (or blocks, for that matter) unless we have some acquires to have caused them, so the last two tests in that line seem to be noise. Since my suggestions are minor, should I go ahead and mark this ready for committer? Thanks, Jeff -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Thu, Jun 14, 2012 at 4:39 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jan 11, 2012 at 8:48 PM, Robert Haas robertmh...@gmail.com wrote: I've had cause, a few times this development cycle, to want to measure the amount of spinning on each lwlock in the system. To that end, I've found the attached patch useful. Note that if you don't define LWLOCK_STATS, this changes nothing except that the return value from s_lock becomes int rather than void. If you do define LWLOCK_STATS, then LWLockAcquire() counts the number of pg_usleep() calls that are required to acquire each LWLock, in addition to the other statistics. Since this has come up for me a few times now, I'd like to proposing including it in core. Well, this fell through the cracks, because I forgot to add it to the January CommitFest. Here it is again, rebased. +1. It might be too awkward to add, but it would be nice to be able to fetch the number of spins as well as number of delays (aside, it's a bit confusing that in s_lock.c 'delay' is used both for the hardware sleep as well as the yielding sleep). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Fri, Jun 15, 2012 at 9:43 AM, Merlin Moncure mmonc...@gmail.com wrote: On Thu, Jun 14, 2012 at 4:39 PM, Robert Haas robertmh...@gmail.com wrote: On Wed, Jan 11, 2012 at 8:48 PM, Robert Haas robertmh...@gmail.com wrote: I've had cause, a few times this development cycle, to want to measure the amount of spinning on each lwlock in the system. To that end, I've found the attached patch useful. Note that if you don't define LWLOCK_STATS, this changes nothing except that the return value from s_lock becomes int rather than void. If you do define LWLOCK_STATS, then LWLockAcquire() counts the number of pg_usleep() calls that are required to acquire each LWLock, in addition to the other statistics. Since this has come up for me a few times now, I'd like to proposing including it in core. Well, this fell through the cracks, because I forgot to add it to the January CommitFest. Here it is again, rebased. +1. It might be too awkward to add, but it would be nice to be able to fetch the number of spins as well as number of delays (aside, it's a bit confusing that in s_lock.c 'delay' is used both for the hardware sleep as well as the yielding sleep). Yeah, I'm inclined to keep it simple for now. We can always change it again later if someone comes up with something better. I suspect that delays probably tracks contention bad enough that you should be worried pretty well, but of course I might be wrong. The only thing I know for sure is that I've found this useful in my own testing, and therefore others testing with LWLOCK_STATS might also find it useful. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Wed, Jan 11, 2012 at 8:48 PM, Robert Haas robertmh...@gmail.com wrote: I've had cause, a few times this development cycle, to want to measure the amount of spinning on each lwlock in the system. To that end, I've found the attached patch useful. Note that if you don't define LWLOCK_STATS, this changes nothing except that the return value from s_lock becomes int rather than void. If you do define LWLOCK_STATS, then LWLockAcquire() counts the number of pg_usleep() calls that are required to acquire each LWLock, in addition to the other statistics. Since this has come up for me a few times now, I'd like to proposing including it in core. Well, this fell through the cracks, because I forgot to add it to the January CommitFest. Here it is again, rebased. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company spindebug-v2.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Thu, Jan 12, 2012 at 1:48 AM, Robert Haas robertmh...@gmail.com wrote: Just to whet your appetite, here are the top spinners on a 32-client SELECT-only test on a 32-core Itanium server. All the locks not shown below have two orders of magnitude less of a problem than these do. Please can you repeat the test, focusing on minutes 10-30 of a 30 minute test run. That removes much of the noise induced during cache priming. My suggested size of database is one that is 80% size of RAM, with shared_buffers set to 40% of RAM or 2GB whichever is higher. That represents the common case where people know how big their data is and purchase RAM accordingly, then set shared_buffers in line with current wisdom. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On Thu, Jan 12, 2012 at 4:00 AM, Simon Riggs si...@2ndquadrant.com wrote: Please can you repeat the test, focusing on minutes 10-30 of a 30 minute test run. That removes much of the noise induced during cache priming. My suggested size of database is one that is 80% size of RAM, with shared_buffers set to 40% of RAM or 2GB whichever is higher. That represents the common case where people know how big their data is and purchase RAM accordingly, then set shared_buffers in line with current wisdom. OK, first, let me remember to attach the patch this time, since as Thom pointed out, I didn't do that before. Second, those guidelines sound quite different from what I've heard before, which is 25% of RAM up to a *maximum* of 8GB. This machine has 383GB of RAM so 40% of memory would be shared_buffers = 153GB, which is a lot bigger than I've heard recommended in the past. So I could potentially do that test, but I'm a bit surprised by the proposed configuration. So far I have been focusing mostly on pgbench at scale factor 100, because that fits into 8GB of shared_buffers. Using a larger working set that does not fit in shared_buffers will expose new bottlenecks, but I've found it's most useful to me to try to isolate the effect of different bottlenecks, which means running the simplest workload that demonstrates at least one meaningful problem. Anyway, please let me know your thoughts on this. On x86, we are apparently, getting to the point where things scale pretty well. On Nate Boley's 32-core machine (which is still busy with other work, so I can't do any more tests there right now), pgbench on unlogged tables at scale factor 100 scales nearly linearly all the way out to 32 clients. On permanent tables, it is bottlenecked by WALInsertLock and throughput is roughly halved. On the other hand, on Itanium, we don't scale even on a read-only workload. It is not yet clear to me why, but part of the problem may be that spinlock contention seems to be much more damaging on Itanium than it is on x86, and we have way too much of it. I ran some 30-minute tests and here are the top spinners. I have not isolated the first 10 minutes from the remainder of the tests, although that could possibly be done. I think I'd need to start the server, run pgbench for 10 minutes, add a marker line to the log, restart pgbench for another 20 minutes, and then only process the log data after the marker point. Doable, but I haven't done it yet, and I'm not convinced the results would be much different. I think latency would be a more interesting thing to graph against time, but I haven't gotten there yet either. SELECT-only test: lwlock 34: shacq 29192894 exacq 34021 blk 47 spin 6468 lwlock 45: shacq 29701770 exacq 34235 blk 42 spin 7393 lwlock 40: shacq 30353555 exacq 35257 blk 45 spin 7585 lwlock 33: shacq 34585153 exacq 35175 blk 61 spin 9450 lwlock 41: shacq 36603119 exacq 35135 blk 50 spin 10231 lwlock 48: shacq 41095856 exacq 35196 blk 55 spin 10723 lwlock 42: shacq 36566360 exacq 34619 blk 65 spin 11171 lwlock 47: shacq 40809540 exacq 34976 blk 77 spin 12168 lwlock 43: shacq 38972147 exacq 35256 blk 48 spin 12253 lwlock 35: shacq 41278265 exacq 35171 blk 71 spin 13023 lwlock 44: shacq 42915161 exacq 35296 blk 54 spin 13704 lwlock 46: shacq 45259281 exacq 35113 blk 69 spin 14633 lwlock 39: shacq 40758171 exacq 35174 blk 52 spin 15125 lwlock 38: shacq 44954521 exacq 34629 blk 70 spin 17291 lwlock 37: shacq 45250155 exacq 34834 blk 69 spin 17422 lwlock 328934: shacq 205772443 exacq 0 blk 0 spin 372594 lwlock 4: shacq 205775132 exacq 356 blk 162 spin 439076 lwlock 36: shacq 246651001 exacq 34587 blk 392 spin 508843 Unlogged tables: lwlock 397680: shacq 27253037 exacq 18 blk 6 spin 7130 lwlock 328724: shacq 28593420 exacq 136 blk 17 spin 8417 lwlock 7: shacq 0 exacq 27803235 blk 713207 spin 10802 lwlock 3: shacq 430 exacq 27801696 blk 519648 spin 11206 lwlock 328942: shacq 56285945 exacq 0 blk 0 spin 39198 lwlock 46: shacq 6309 exacq 47931 blk 1677 spin 39419 lwlock 45: shacq 63003036 exacq 48607 blk 1669 spin 39541 lwlock 40: shacq 65456415 exacq 48770 blk 1751 spin 44276 lwlock 33: shacq 68438336 exacq 48371 blk 1882 spin 45955 lwlock 35: shacq 68170010 exacq 48193 blk 1869 spin 47024 lwlock 39: shacq 65045489 exacq 47748 blk 1798 spin 47246 lwlock 41: shacq 67985476 exacq 48075 blk 1778 spin 48409 lwlock 37: shacq 69042800 exacq 48529 blk 1854 spin 49465 lwlock 34: shacq 71838042 exacq 48021 blk 1816 spin 51078 lwlock 38: shacq 69249910 exacq 48673 blk 1861 spin 51443 lwlock 36: shacq 84116654 exacq 48646 blk 1832 spin 52778 lwlock 44: shacq 74827084 exacq 48039 blk 1934 spin 54133 lwlock 42: shacq 76580702 exacq 47829 blk 2045 spin 60652 lwlock 48: shacq 95540241 exacq 48460 blk 2295 spin 69160 lwlock 47: shacq 104968331 exacq 48252 blk 2998 spin 101368 lwlock 11: shacq 103820073 exacq 28055301 blk 5097966 spin 138136 lwlock 43: shacq 128230139 exacq 48940 blk 3785 spin 168432 lwlock 4: shacq
[HACKERS] measuring spinning
I've had cause, a few times this development cycle, to want to measure the amount of spinning on each lwlock in the system. To that end, I've found the attached patch useful. Note that if you don't define LWLOCK_STATS, this changes nothing except that the return value from s_lock becomes int rather than void. If you do define LWLOCK_STATS, then LWLockAcquire() counts the number of pg_usleep() calls that are required to acquire each LWLock, in addition to the other statistics. Since this has come up for me a few times now, I'd like to proposing including it in core. Just to whet your appetite, here are the top spinners on a 32-client SELECT-only test on a 32-core Itanium server. All the locks not shown below have two orders of magnitude less of a problem than these do. lwlock 48: shacq 6042357 exacq 34590 blk 53 spin 1288 lwlock 42: shacq 5014729 exacq 34942 blk 58 spin 1321 lwlock 43: shacq 5448771 exacq 34725 blk 44 spin 1608 lwlock 44: shacq 6420992 exacq 34980 blk 67 spin 1713 lwlock 35: shacq 6353111 exacq 34256 blk 59 spin 1784 lwlock 38: shacq 6052801 exacq 34913 blk 70 spin 1801 lwlock 46: shacq 6401413 exacq 34698 blk 58 spin 1879 lwlock 36: shacq 6051887 exacq 35309 blk 73 spin 1915 lwlock 45: shacq 6812817 exacq 35170 blk 59 spin 1924 lwlock 33: shacq 6793666 exacq 35009 blk 59 spin 1955 lwlock 34: shacq 6395994 exacq 34941 blk 58 spin 2019 lwlock 40: shacq 6388136 exacq 34579 blk 51 spin 2054 lwlock 37: shacq 7250574 exacq 35242 blk 73 spin 2409 lwlock 39: shacq 7109729 exacq 34892 blk 65 spin 2632 lwlock 47: shacq 8243963 exacq 35256 blk 88 spin 3018 lwlock 328936: shacq 33992167 exacq 0 blk 0 spin 59816 lwlock 4: shacq 33994583 exacq 168 blk 40 spin 74018 lwlock 41: shacq 40098590 exacq 35001 blk 440 spin 81324 grant total: shacq 361775238 exacq 172965995 blk 2315 spin 245342 This is a 5-minute test run. If I had to take a shot in the dark, the buffer mapping lock in the #1 spot is the one protecting the root page of pgbench_acocunts_pkey, and the high numbered lock in position #3 is the content lock on the buffer itself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] measuring spinning
On 12 January 2012 01:48, Robert Haas robertmh...@gmail.com wrote: I've had cause, a few times this development cycle, to want to measure the amount of spinning on each lwlock in the system. To that end, I've found the attached patch useful. Note that if you don't define LWLOCK_STATS, this changes nothing except that the return value from s_lock becomes int rather than void. If you do define LWLOCK_STATS, then LWLockAcquire() counts the number of pg_usleep() calls that are required to acquire each LWLock, in addition to the other statistics. Since this has come up for me a few times now, I'd like to proposing including it in core. Just to whet your appetite, here are the top spinners on a 32-client SELECT-only test on a 32-core Itanium server. All the locks not shown below have two orders of magnitude less of a problem than these do. lwlock 48: shacq 6042357 exacq 34590 blk 53 spin 1288 lwlock 42: shacq 5014729 exacq 34942 blk 58 spin 1321 lwlock 43: shacq 5448771 exacq 34725 blk 44 spin 1608 lwlock 44: shacq 6420992 exacq 34980 blk 67 spin 1713 lwlock 35: shacq 6353111 exacq 34256 blk 59 spin 1784 lwlock 38: shacq 6052801 exacq 34913 blk 70 spin 1801 lwlock 46: shacq 6401413 exacq 34698 blk 58 spin 1879 lwlock 36: shacq 6051887 exacq 35309 blk 73 spin 1915 lwlock 45: shacq 6812817 exacq 35170 blk 59 spin 1924 lwlock 33: shacq 6793666 exacq 35009 blk 59 spin 1955 lwlock 34: shacq 6395994 exacq 34941 blk 58 spin 2019 lwlock 40: shacq 6388136 exacq 34579 blk 51 spin 2054 lwlock 37: shacq 7250574 exacq 35242 blk 73 spin 2409 lwlock 39: shacq 7109729 exacq 34892 blk 65 spin 2632 lwlock 47: shacq 8243963 exacq 35256 blk 88 spin 3018 lwlock 328936: shacq 33992167 exacq 0 blk 0 spin 59816 lwlock 4: shacq 33994583 exacq 168 blk 40 spin 74018 lwlock 41: shacq 40098590 exacq 35001 blk 440 spin 81324 grant total: shacq 361775238 exacq 172965995 blk 2315 spin 245342 This is a 5-minute test run. If I had to take a shot in the dark, the buffer mapping lock in the #1 spot is the one protecting the root page of pgbench_acocunts_pkey, and the high numbered lock in position #3 is the content lock on the buffer itself. Patch missing. -- Thom -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers