Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On Wed, Feb 10, 2016 at 10:55:10AM -0500, Tom Lane wrote: > Interestingly, we seem to have managed to greatly reduce the "other" > time (which I presume is basically mdpostchkpt unlinking) since 9.2. > The worst case observed in HEAD is about 100s: > > regression=# select ts,write,sync,total-write-sync as other, total from > chkpts order by total-write-sync desc limit 10; > ts | write | sync | other | total > +-+-+-+- > 2016-02-09 08:27:04.365-08 | 14.479 | 298.043 | 100.586 | 413.108 > 2016-02-09 06:15:22.477-08 | 75.099 | 26.590 | 90.452 | 192.141 > 2016-02-09 16:31:34.661-08 | 254.462 | 186.246 | 51.541 | 492.249 > 2016-02-09 15:25:01.586-08 | 0.015 | 2.985 | 47.822 | 50.822 > 2016-02-09 11:40:13.565-08 | 287.327 | 314.218 | 44.059 | 645.604 > 2016-02-09 08:56:51.339-08 | 135.464 | 31.242 | 39.131 | 205.837 > 2016-02-09 20:23:52.797-08 | 1.309 | 12.155 | 36.350 | 49.814 > 2016-02-09 00:51:41.157-08 | 134.502 | 103.482 | 34.110 | 272.094 > 2016-02-09 19:28:33.154-08 | 0.575 | 20.434 | 33.803 | 54.812 > 2016-02-09 03:47:02.57-08 | 0.066 | 37.475 | 33.119 | 70.660 > (10 rows) > > but 9.2's worst cases greatly exceed that: > > regression=# select ts,write,sync,total-write-sync as other, total from > chkpts92 order by total-write-sync desc limit 10; > ts | write | sync | other | total > ++-+-+- > 2016-02-09 03:27:12.904-08 | 48.761 | 124.453 | 299.877 | 473.091 > 2016-02-09 16:31:25.657-08 | 3.860 | 184.042 | 248.488 | 436.390 > 2016-02-08 23:33:36.16-08 | 66.904 | 397.653 | 229.020 | 693.577 > 2016-02-09 11:39:57.061-08 | 2.471 | 303.924 | 201.923 | 508.318 > 2016-02-09 07:37:44.48-08 | 3.317 | 1.494 | 158.159 | 162.970 > 2016-02-09 04:32:49.668-08 | 11.328 | 292.310 | 148.088 | 451.726 > 2016-02-09 12:37:43.52-08 | 17.620 | 333.983 | 141.172 | 492.775 > 2016-02-09 08:27:04.407-08 | 12.019 | 300.510 | 102.855 | 415.384 > 2016-02-09 18:17:32.193-08 | 90.370 | 25.369 | 63.626 | 179.365 > 2016-02-09 11:02:14.977-08 | 2.713 | 2.306 | 38.581 | 43.600 > (10 rows) > > Not real sure where that improvement came from. We could hardly > be unlinking fewer files overall, and the bottleneck seems to be > in the kernel, so what changed? The *average*, as opposed to worst > case, checkpoint time has definitely grown since 9.2: Good question; I don't know. I would have expected 9.2's xlog-driven checkpoints to create more stability than HEAD's time-driven checkpoints. On the AIX animals, I have now set PGCTLTIMEOUT=900 and removed the "-t 120" arguments. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/10/2016 12:53 PM, Tom Lane wrote: Andrew Dunstan writes: Yeah. It's faintly possible that a kernel upgrade will help. Another data point. I have another RPi2B that is running Debian Wheezy rather than the Fedora remix. I'm running the same test on it we ran yesterday on axolotl. It seems to be running without having the same problems. So maybe the particular kernel port to ARM is what's tripping us up. I'd bet against it being a port-specific problem; it sounds more like a filesystem performance-tuning issue, which could easily change from one kernel version to another. What are the kernel version numbers exactly? axolotl: Linux pirouette 3.18.13-501.20150510gitf36e19f.sc20.armv7hl.bcm2709 #1 SMP PREEMPT debian: Linux piratic 3.18.7-v7+ #755 SMP PREEMPT cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Andrew Dunstan writes: >> Yeah. It's faintly possible that a kernel upgrade will help. > Another data point. I have another RPi2B that is running Debian Wheezy > rather than the Fedora remix. I'm running the same test on it we ran > yesterday on axolotl. It seems to be running without having the same > problems. So maybe the particular kernel port to ARM is what's tripping > us up. I'd bet against it being a port-specific problem; it sounds more like a filesystem performance-tuning issue, which could easily change from one kernel version to another. What are the kernel version numbers exactly? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/09/2016 11:21 PM, Andrew Dunstan wrote: The idea I was toying with is that previous filesystem activity (making the temp install, the server's never-fsync'd writes, etc) has built up a bunch of dirty kernel buffers, and at some point the kernel goes nuts writing all that data. So the issues we're seeing would come and go depending on the timing of that I/O spike. I'm not sure how to prove such a theory from here. Yeah. It's faintly possible that a kernel upgrade will help. Another data point. I have another RPi2B that is running Debian Wheezy rather than the Fedora remix. I'm running the same test on it we ran yesterday on axolotl. It seems to be running without having the same problems. So maybe the particular kernel port to ARM is what's tripping us up. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 2016-02-09 22:27:07 -0500, Tom Lane wrote: > The idea I was toying with is that previous filesystem activity (making > the temp install, the server's never-fsync'd writes, etc) has built up a > bunch of dirty kernel buffers, and at some point the kernel goes nuts > writing all that data. So the issues we're seeing would come and go > depending on the timing of that I/O spike. I'm not sure how to prove > such a theory from here. It'd be interesting to monitor $ grep -E '^(Dirty|Writeback):' /proc/meminfo output. At least on linux. It's terribly easy to get the kernel into a state where it has so much data needing to be written back that an immediate checkpoint takes pretty much forever. If I understand the code correctly, once a buffer has been placed into 'writeback', it'll be more-or-less processed in order. That can e.g. be because these buffers have been written to more than 30s ago. If there then are buffers later that also need to be written back (e.g. due to an fsync()), you'll often wait for the earlier ones. Andres -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Noah Misch writes: >>> That's reasonable. If you would like higher-fidelity data, I can run loops >>> of >>> "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run >>> that for HEAD and 9.2 simultaneously. A day of logs from that should show >>> clearly if HEAD is systematically worse than 9.2. >> That sounds like a fine plan, please do it. > Log files: > HEAD: > https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMakl2TjFHUlpvc1k > 92: > https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMYVZtY3VqcjBFX1k Thanks for doing this. I did a bit of analysis of these logs. There is no case in the HEAD log where 'lock files all released' comes out more than 2.221 seconds after 'shutdown checkpoint complete'. So it doesn't appear that stats writing is ever a big problem for your machine. The checkpoints themselves, though ... here are the slowest few checkpoints on HEAD: regression=# select ts,write,sync,total-write-sync as other, total from chkpts order by total desc limit 10; ts | write | sync | other | total +-+-+-+- 2016-02-09 03:33:14.671-08 | 287.691 | 355.528 | 12.509 | 655.728 2016-02-09 11:40:13.565-08 | 287.327 | 314.218 | 44.059 | 645.604 2016-02-09 12:36:49.753-08 | 149.652 | 481.736 | 13.094 | 644.482 2016-02-09 04:32:53.708-08 | 214.896 | 304.102 | 17.929 | 536.927 2016-02-09 16:31:34.661-08 | 254.462 | 186.246 | 51.541 | 492.249 2016-02-09 08:27:04.365-08 | 14.479 | 298.043 | 100.586 | 413.108 2016-02-09 07:37:57.999-08 | 280.960 | 10.408 | 14.287 | 305.655 2016-02-09 07:06:30.121-08 | 155.331 | 93.433 | 23.814 | 272.578 2016-02-09 17:45:34.422-08 | 149.471 | 104.185 | 18.598 | 272.254 2016-02-09 00:51:41.157-08 | 134.502 | 103.482 | 34.110 | 272.094 (10 rows) and the same for 9.2: regression=# select ts,write,sync,total-write-sync as other, total from chkpts92 order by total desc limit 10; ts | write | sync | other | total +-+-+-+- 2016-02-08 23:33:36.16-08 | 66.904 | 397.653 | 229.020 | 693.577 2016-02-09 11:39:57.061-08 | 2.471 | 303.924 | 201.923 | 508.318 2016-02-09 12:37:43.52-08 | 17.620 | 333.983 | 141.172 | 492.775 2016-02-09 03:27:12.904-08 | 48.761 | 124.453 | 299.877 | 473.091 2016-02-09 04:32:49.668-08 | 11.328 | 292.310 | 148.088 | 451.726 2016-02-09 16:31:25.657-08 | 3.860 | 184.042 | 248.488 | 436.390 2016-02-09 08:27:04.407-08 | 12.019 | 300.510 | 102.855 | 415.384 2016-02-09 03:32:30.718-08 | 296.907 | 15.260 | 5.644 | 317.811 2016-02-09 15:31:41.691-08 | 0.385 | 307.996 | 0.155 | 308.536 2016-02-09 12:23:10.92-08 | 93.634 | 161.581 | 19.342 | 274.557 (10 rows) It looks like you're going to have to set PGCTLTIMEOUT somewhere north of 10 minutes to keep these animals from failing under load. Interestingly, we seem to have managed to greatly reduce the "other" time (which I presume is basically mdpostchkpt unlinking) since 9.2. The worst case observed in HEAD is about 100s: regression=# select ts,write,sync,total-write-sync as other, total from chkpts order by total-write-sync desc limit 10; ts | write | sync | other | total +-+-+-+- 2016-02-09 08:27:04.365-08 | 14.479 | 298.043 | 100.586 | 413.108 2016-02-09 06:15:22.477-08 | 75.099 | 26.590 | 90.452 | 192.141 2016-02-09 16:31:34.661-08 | 254.462 | 186.246 | 51.541 | 492.249 2016-02-09 15:25:01.586-08 | 0.015 | 2.985 | 47.822 | 50.822 2016-02-09 11:40:13.565-08 | 287.327 | 314.218 | 44.059 | 645.604 2016-02-09 08:56:51.339-08 | 135.464 | 31.242 | 39.131 | 205.837 2016-02-09 20:23:52.797-08 | 1.309 | 12.155 | 36.350 | 49.814 2016-02-09 00:51:41.157-08 | 134.502 | 103.482 | 34.110 | 272.094 2016-02-09 19:28:33.154-08 | 0.575 | 20.434 | 33.803 | 54.812 2016-02-09 03:47:02.57-08 | 0.066 | 37.475 | 33.119 | 70.660 (10 rows) but 9.2's worst cases greatly exceed that: regression=# select ts,write,sync,total-write-sync as other, total from chkpts92 order by total-write-sync desc limit 10; ts | write | sync | other | total ++-+-+- 2016-02-09 03:27:12.904-08 | 48.761 | 124.453 | 299.877 | 473.091 2016-02-09 16:31:25.657-08 | 3.860 | 184.042 | 248.488 | 436.390 2016-02-08 23:33:36.16-08 | 66.904 | 397.653 | 229.020 | 693.577 2016-02-09 11:39:57.061-08 | 2.471 | 303.924 | 201.923 | 508.318 2016-02-09 07:37:44.48-08 | 3.317 | 1.494 | 158.159 | 162.970 2016-02-09 04:32:49.668-08 | 11.328 | 292.310 | 148.088 | 451.726 2016-02-09 12:37:43.52-08 | 17.620 | 333.983 | 141.172 | 492.775 2016-02-09 08:27:04.407-08 | 12.019 | 300.510 | 102.855 | 415.384 2016-02-09 18:17:32.193-08 | 90.370 | 25.369 | 63.626 | 17
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On Mon, Feb 08, 2016 at 10:55:24PM -0500, Tom Lane wrote: > Noah Misch writes: > > On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote: > >> We've seen variants > >> on this theme on half a dozen machines just in the past week --- and it > >> seems to mostly happen in 9.5 and HEAD, which is fishy. > > > It has been affecting only the four AIX animals, which do share hardware. > > (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.) > > Certainly your AIX critters have shown this a bunch, but here's another > current example: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23 Oops; I did not consider Monday's results before asserting that. > > That's reasonable. If you would like higher-fidelity data, I can run loops > > of > > "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run > > that for HEAD and 9.2 simultaneously. A day of logs from that should show > > clearly if HEAD is systematically worse than 9.2. > > That sounds like a fine plan, please do it. Log files: HEAD: https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMakl2TjFHUlpvc1k 92: https://drive.google.com/uc?export=download&id=0B9IURs2-_2ZMYVZtY3VqcjBFX1k While I didn't study those logs in detail, a few things jumped out. Since 9.2, we've raised the default shared_buffers from 32MB to 128MB, and we've replaced checkpoint_segments=3 with max_wal_size=1GB. Both changes encourage bulkier checkpoints. The 9.2 test runs get one xlog-driven checkpoint before the shutdown checkpoint, while HEAD gets one time-driven checkpoint. Also, the HEAD suite just tests more things. Here's pg_stat_bgwriter afterward: HEAD: checkpoints_timed | 156 checkpoints_req | 799 checkpoint_write_time | 16035847 checkpoint_sync_time | 6555396 buffers_checkpoint| 744131 buffers_clean | 0 maxwritten_clean | 0 buffers_backend | 3023444 buffers_backend_fsync | 0 buffers_alloc | 1777010 stats_reset | 2016-02-08 21:04:24.499607-08 9.2: checkpoints_timed | 39 checkpoints_req | 1369 checkpoint_write_time | 14875776 checkpoint_sync_time | 8397536 buffers_checkpoint| 396272 buffers_clean | 466392 maxwritten_clean | 1336 buffers_backend | 1961531 buffers_backend_fsync | 0 buffers_alloc | 1681324 stats_reset | 2016-02-08 21:09:21.925487-08 Most notable there is the lack of bgwriter help in HEAD. The clusters had initdb-default configuration apart from these additions: listen_addresses='' log_line_prefix = '%p %m ' logging_collector = on log_autovacuum_min_duration = 0 log_checkpoints = on log_lock_waits = on log_temp_files = 128kB I should have added fsync=off, too. Notice how the AIX animals failed left and right today, likely thanks to contention from these runs. On Tue, Feb 09, 2016 at 02:10:50PM -0500, Tom Lane wrote: > (1) Slow file system, specifically slow unlink, is the core of the > problem. (I wonder if the AIX critters are using an NFS filesystem?) The buildfarm files lie on 600 GiB SAS disks. I suspect metadata operations, like unlink(), bottleneck on the jfs2 journal. On Tue, Feb 09, 2016 at 03:05:01PM -0500, Tom Lane wrote: > I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and > *removing* the two existing hacks in run_build.pl that try to force -t 120. > > The only real argument I can see against that approach is that we'd have > to back-patch the PGCTLTIMEOUT patch to all active branches if we want > to stop the buildfarm failures. We don't usually back-patch feature > additions. On the other hand, this wouldn't be the first time we've > back-patched something on grounds of helping the buildfarm, so I find > that argument pretty weak. If I were a purist against back-patching features, I might name the variable PGINTERNAL_TEST_PGCTLTIMEOUT and not document it. Meh to that. I'll plan to commit the original tomorrow. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/09/2016 10:27 PM, Tom Lane wrote: Noah Misch writes: On Tue, Feb 09, 2016 at 10:02:17PM -0500, Tom Lane wrote: I wonder if it's worth sticking some instrumentation into stats collector shutdown? I wouldn't be surprised if the collector got backlogged during the main phase of testing and took awhile to chew through its message queue before even starting the write of the final stats. But why would the ecpg tests show such an effect when the main regression tests don't? AFAIK the ecpg tests don't exactly stress the server --- note the trivial amount of data written by the shutdown checkpoint, for instance. The main regression tests run with the stats file on the ramdisk. The other weird thing is that it's only sometimes slow. If you look at the last buildfarm result from axolotl, for instance, the tail end of the ecpg log is LOG: ShutdownSUBTRANS() complete at 2016-02-09 16:31:14.784 EST LOG: database system is shut down at 2016-02-09 16:31:14.784 EST LOG: lock files all released at 2016-02-09 16:31:14.817 EST so we only spent ~50ms on stats write that time. That part is puzzling. The idea I was toying with is that previous filesystem activity (making the temp install, the server's never-fsync'd writes, etc) has built up a bunch of dirty kernel buffers, and at some point the kernel goes nuts writing all that data. So the issues we're seeing would come and go depending on the timing of that I/O spike. I'm not sure how to prove such a theory from here. Yeah. It's faintly possible that a kernel upgrade will help. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Noah Misch writes: > On Tue, Feb 09, 2016 at 10:02:17PM -0500, Tom Lane wrote: >> I wonder if it's worth sticking some instrumentation into stats >> collector shutdown? > I wouldn't be surprised if the collector got backlogged during the main phase > of testing and took awhile to chew through its message queue before even > starting the write of the final stats. But why would the ecpg tests show such an effect when the main regression tests don't? AFAIK the ecpg tests don't exactly stress the server --- note the trivial amount of data written by the shutdown checkpoint, for instance. The other weird thing is that it's only sometimes slow. If you look at the last buildfarm result from axolotl, for instance, the tail end of the ecpg log is LOG: ShutdownSUBTRANS() complete at 2016-02-09 16:31:14.784 EST LOG: database system is shut down at 2016-02-09 16:31:14.784 EST LOG: lock files all released at 2016-02-09 16:31:14.817 EST so we only spent ~50ms on stats write that time. The idea I was toying with is that previous filesystem activity (making the temp install, the server's never-fsync'd writes, etc) has built up a bunch of dirty kernel buffers, and at some point the kernel goes nuts writing all that data. So the issues we're seeing would come and go depending on the timing of that I/O spike. I'm not sure how to prove such a theory from here. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Jim Nasby writes: > On 2/8/16 2:45 PM, Tom Lane wrote: >> I had in mind to just "git revert" the patch when we're done with it. > It's already difficult enough for DBAs to debug some performance issues, > so getting rid of logging is a step backwards. I realize it's unlikely > that you could run DEBUG2 in a prod environment, but we still shouldn't > be reducing visibility. Meh. It seems clear to me that we should move the "database system is shut down" message so that it comes out only after the postmaster has removed its lockfiles (and hence is really gone so far as any outside vantage point is concerned). But I do not think any of the rest of what I put in has any lasting value. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On Tue, Feb 09, 2016 at 10:02:17PM -0500, Tom Lane wrote: > Still, it seems clear that the bulk of the shutdown time is indeed the > stats collector taking its time about shutting down, which is doubly > weird because the ecpg tests shouldn't have created very many tables, > so why would there be a lot of data to write? Even granting that it's > not writing to ramdisk, 57 seconds to shut down seems pretty excessive. > > I wonder if it's worth sticking some instrumentation into stats > collector shutdown? I wouldn't be surprised if the collector got backlogged during the main phase of testing and took awhile to chew through its message queue before even starting the write of the final stats. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Andrew Dunstan writes: > anyway, we got a failure pretty quickly: > pg_ctl: server does not shut down at 2016-02-09 21:10:11.914 EST > ... > LOG: received fast shutdown request at 2016-02-09 21:09:11.824 EST > ... > LOG: checkpointer dead at 2016-02-09 21:09:14.683 EST > LOG: all children dead at 2016-02-09 21:10:11.184 EST > ... > LOG: lock files all released at 2016-02-09 21:10:11.211 EST Hmm. Apparently, pg_ctl gave up exactly one second too early. The way the wait loop in pg_ctl is coded, it waits one second more after the last get_pgpid() probe before complaining --- so the last time it looked for the pidfile was approximately 21:10:10.914, just 300ms before the postmaster removed it. I wonder if that's entirely coincidence. Still, it seems clear that the bulk of the shutdown time is indeed the stats collector taking its time about shutting down, which is doubly weird because the ecpg tests shouldn't have created very many tables, so why would there be a lot of data to write? Even granting that it's not writing to ramdisk, 57 seconds to shut down seems pretty excessive. I wonder if it's worth sticking some instrumentation into stats collector shutdown? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 2/8/16 2:45 PM, Tom Lane wrote: Alvaro Herrera writes: Tom Lane wrote: What I'd like to do to investigate this is put in a temporary HEAD-only patch that makes ShutdownXLOG() and its subroutines much chattier about how far they've gotten and what time it is, and also makes pg_ctl print out the current time if it gives up waiting. Seems like a reasonable place to start. I assume you'll be installing some debug-elog calls, enabled by default, and then once the problem is fixed, we'll change the default to disabled but keep the actual calls? I had in mind to just "git revert" the patch when we're done with it. There might be some parts we want to keep (for instance, I'm thinking of logging "postmaster is done" after we've unlinked the pidfile, which might be useful permanently). But I plan to err on the side of noisiness for the moment, rather than make something I think has long-term value. It's already difficult enough for DBAs to debug some performance issues, so getting rid of logging is a step backwards. I realize it's unlikely that you could run DEBUG2 in a prod environment, but we still shouldn't be reducing visibility. -- Jim Nasby, Data Architect, Blue Treble Consulting, Austin TX Experts in Analytics, Data Architecture and PostgreSQL Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/09/2016 08:49 PM, Tom Lane wrote: Andrew Dunstan writes: On 02/09/2016 07:49 PM, Tom Lane wrote: However, I'd already noted from some other digging in the buildfarm logs that axolotl's speed seems to vary tremendously. I do not know what else you typically run on that hardware, but putting it under full load might help prove things. Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a buildfarm animal. About the only other thing of note is a very lightly configured Nagios instance. Huh, that's quite strange. There is one fairly recent report of axolotl failing "make check" because of taking over a minute to shut down: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-14%2020%3A30%3A52 but the runs before and after that show shutdown times of only a second or two. No idea why. anyway, we got a failure pretty quickly: pg_ctl: server does not shut down at 2016-02-09 21:10:11.914 EST pg_regress: could not stop postmaster: exit code was 256 Makefile:81: recipe for target 'check' failed make: *** [check] Error 2 The log traces from the shutdown are below. cheers andrew LOG: received fast shutdown request at 2016-02-09 21:09:11.824 EST LOG: aborting any active transactions LOG: autovacuum launcher shutting down at 2016-02-09 21:09:11.830 EST LOG: shutting down at 2016-02-09 21:09:11.839 EST LOG: checkpoint starting: shutdown immediate LOG: CheckPointGuts starting at 2016-02-09 21:09:11.840 EST LOG: CheckPointCLOG() done at 2016-02-09 21:09:11.840 EST LOG: CheckPointCommitTs() done at 2016-02-09 21:09:11.840 EST LOG: CheckPointSUBTRANS() done at 2016-02-09 21:09:11.841 EST LOG: CheckPointMultiXact() done at 2016-02-09 21:09:11.841 EST LOG: CheckPointPredicate() done at 2016-02-09 21:09:11.841 EST LOG: CheckPointRelationMap() done at 2016-02-09 21:09:11.841 EST LOG: CheckPointReplicationSlots() done at 2016-02-09 21:09:11.841 EST LOG: CheckPointSnapBuild() done at 2016-02-09 21:09:11.841 EST LOG: CheckPointLogicalRewriteHeap() done at 2016-02-09 21:09:11.842 EST LOG: BufferSync(5) beginning to write 172 buffers at 2016-02-09 21:09:11.845 EST LOG: BufferSync(5) done, wrote 172/172 buffers at 2016-02-09 21:09:14.635 EST LOG: CheckPointBuffers() done at 2016-02-09 21:09:14.636 EST LOG: CheckPointReplicationOrigin() done at 2016-02-09 21:09:14.636 EST LOG: CheckPointGuts done at 2016-02-09 21:09:14.636 EST LOG: checkpoint WAL record flushed at 2016-02-09 21:09:14.636 EST LOG: pg_control updated at 2016-02-09 21:09:14.637 EST LOG: smgrpostckpt() done at 2016-02-09 21:09:14.668 EST LOG: RemoveOldXlogFiles() done at 2016-02-09 21:09:14.668 EST LOG: TruncateSUBTRANS() done at 2016-02-09 21:09:14.669 EST LOG: checkpoint complete: wrote 172 buffers (1.0%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=2.794 s, sync=0.000 s, total=2.829 s; sync files=0, longest=0.000 s, average=0.000 s; distance=966 kB, estimate=966 kB LOG: shutdown checkpoint complete at 2016-02-09 21:09:14.669 EST LOG: ShutdownCLOG() complete at 2016-02-09 21:09:14.669 EST LOG: ShutdownCommitTs() complete at 2016-02-09 21:09:14.669 EST LOG: ShutdownSUBTRANS() complete at 2016-02-09 21:09:14.669 EST LOG: database system is shut down at 2016-02-09 21:09:14.669 EST LOG: doing before_shmem_exit 0 at 2016-02-09 21:09:14.670 EST LOG: doing on_shmem_exit 2 at 2016-02-09 21:09:14.670 EST LOG: doing on_shmem_exit 1 at 2016-02-09 21:09:14.670 EST LOG: doing on_shmem_exit 0 at 2016-02-09 21:09:14.670 EST LOG: doing on_proc_exit 1 at 2016-02-09 21:09:14.670 EST LOG: doing on_proc_exit 0 at 2016-02-09 21:09:14.671 EST LOG: calling exit(0) at 2016-02-09 21:09:14.671 EST LOG: checkpointer dead at 2016-02-09 21:09:14.683 EST LOG: all children dead at 2016-02-09 21:10:11.184 EST LOG: doing on_shmem_exit 3 at 2016-02-09 21:10:11.191 EST LOG: doing on_shmem_exit 2 at 2016-02-09 21:10:11.191 EST LOG: doing on_shmem_exit 1 at 2016-02-09 21:10:11.192 EST LOG: doing on_shmem_exit 0 at 2016-02-09 21:10:11.208 EST LOG: doing on_proc_exit 1 at 2016-02-09 21:10:11.209 EST LOG: doing on_proc_exit 0 at 2016-02-09 21:10:11.209 EST LOG: lock files all released at 2016-02-09 21:10:11.211 EST LOG: calling exit(0) at 2016-02-09 21:10:11.211 EST -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Andrew Dunstan writes: > On 02/09/2016 07:49 PM, Tom Lane wrote: >> However, I'd already noted from some other digging in the buildfarm >> logs that axolotl's speed seems to vary tremendously. I do not >> know what else you typically run on that hardware, but putting it >> under full load might help prove things. > Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a > buildfarm animal. About the only other thing of note is a very lightly > configured Nagios instance. Huh, that's quite strange. There is one fairly recent report of axolotl failing "make check" because of taking over a minute to shut down: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-14%2020%3A30%3A52 but the runs before and after that show shutdown times of only a second or two. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/09/2016 07:49 PM, Tom Lane wrote: Andrew Dunstan writes: So running it's not running with fsync off or using the ramdisk for stats_temp_directory. Of course, that doesn't explain why we're not seeing it on branches earlier than 9.5, but it could explain why we're only seeing it on the ecpg tests. BTW, the evidence we already have from axolotl's last run is that post-checkpoint shutdown in the ecpg test was pretty quick: LOG: database system is shut down at 2016-02-09 16:31:14.784 EST LOG: lock files all released at 2016-02-09 16:31:14.817 EST However, I'd already noted from some other digging in the buildfarm logs that axolotl's speed seems to vary tremendously. I do not know what else you typically run on that hardware, but putting it under full load might help prove things. Almost nothing. It's a Raspberry Pi 2B that I got mainly to run a buildfarm animal. About the only other thing of note is a very lightly configured Nagios instance. Of course, I could put it under continuous load running a compilation or something. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Andrew Dunstan writes: > So running it's not running with fsync off or using the ramdisk for > stats_temp_directory. Of course, that doesn't explain why we're not > seeing it on branches earlier than 9.5, but it could explain why we're > only seeing it on the ecpg tests. BTW, the evidence we already have from axolotl's last run is that post-checkpoint shutdown in the ecpg test was pretty quick: LOG: database system is shut down at 2016-02-09 16:31:14.784 EST LOG: lock files all released at 2016-02-09 16:31:14.817 EST However, I'd already noted from some other digging in the buildfarm logs that axolotl's speed seems to vary tremendously. I do not know what else you typically run on that hardware, but putting it under full load might help prove things. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
I wrote: > Anyway, I think I should push this additional instrumentation so you > can use it on axolotl. Done. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Andrew Dunstan writes: > Incidentally, as I noted earlier, the ecpg tests don't honour > TEMP_CONFIG, and in axolotl's case this could well make a difference, as > it it set up like this: > ... > So running it's not running with fsync off or using the ramdisk for > stats_temp_directory. Oooohh ... I had just been looking into what else the postmaster does after "database system is shut down" comes out. One of those activities is telling the stats collector to shut down, and then waiting for it to do so. So a really slow write of the collected stats might possibly account for delaying things here. What I was doing was adding some more logging in the post-shutdown-checkpoint area, and this is what I see on dromedary, when shutting down just after a regression test run: LOG: database system is shut down at 2016-02-09 19:12:29.497 EST LOG: doing before_shmem_exit 0 at 2016-02-09 19:12:29.498 EST LOG: doing on_shmem_exit 2 at 2016-02-09 19:12:29.498 EST LOG: doing on_shmem_exit 1 at 2016-02-09 19:12:29.498 EST LOG: doing on_shmem_exit 0 at 2016-02-09 19:12:29.498 EST LOG: doing on_proc_exit 1 at 2016-02-09 19:12:29.498 EST LOG: doing on_proc_exit 0 at 2016-02-09 19:12:29.498 EST LOG: calling exit(0) at 2016-02-09 19:12:29.498 EST LOG: checkpointer dead at 2016-02-09 19:12:32.382 EST LOG: all children dead at 2016-02-09 19:12:32.387 EST LOG: doing on_shmem_exit 3 at 2016-02-09 19:12:32.387 EST LOG: doing on_shmem_exit 2 at 2016-02-09 19:12:32.387 EST LOG: doing on_shmem_exit 1 at 2016-02-09 19:12:32.387 EST LOG: doing on_shmem_exit 0 at 2016-02-09 19:12:32.396 EST LOG: doing on_proc_exit 1 at 2016-02-09 19:12:32.396 EST LOG: doing on_proc_exit 0 at 2016-02-09 19:12:32.396 EST LOG: lock files all released at 2016-02-09 19:12:32.397 EST LOG: calling exit(0) at 2016-02-09 19:12:32.397 EST The first batch of those "on_shmem/proc_exit" reports are from the checkpointer process, and the second set are from the postmaster. The stats collector shutdown would be between "checkpointer dead" and "all children dead". We can see that on this machine, that's not really an issue ... but how in the world did it take the postmaster nigh three seconds to notice the checkpoint process exit? Something very odd indeed there. Anyway, I think I should push this additional instrumentation so you can use it on axolotl. This also makes it look like we really need to revisit where/when the "database system is shut down" message is printed. But that's for later. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/09/2016 06:46 PM, Andrew Dunstan wrote: On 02/09/2016 05:53 PM, Tom Lane wrote: Andrew, I wonder if I could prevail on you to make axolotl run "make check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can see if the logging I added tells anything useful about this. Will do. Incidentally, as I noted earlier, the ecpg tests don't honour TEMP_CONFIG, and in axolotl's case this could well make a difference, as it it set up like this: extra_config =>{ DEFAULT => [ q(log_line_prefix = '%m [%c:%l] '), "log_connections = 'true'", "log_disconnections = 'true'", "log_statement = 'all'", "fsync = off", "stats_temp_directory='/home/andrew/bf/root/stats_temp/$branch'" ], }, So running it's not running with fsync off or using the ramdisk for stats_temp_directory. Of course, that doesn't explain why we're not seeing it on branches earlier than 9.5, but it could explain why we're only seeing it on the ecpg tests. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/09/2016 05:53 PM, Tom Lane wrote: Andrew, I wonder if I could prevail on you to make axolotl run "make check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can see if the logging I added tells anything useful about this. Will do. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
I wrote: > ... However, there is something else happening > on axolotl. Looking at the HEAD and 9.5 branches, there are three very > similar failures in the ECPG step within the past 60 days: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-15%2018%3A49%3A31 > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-12%2001%3A44%3A39 Oh look, searching with the "failures" BF webpage turns up a fourth such failure on sungazer: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-01-18%2014%3A45%3A15 The plot thickens. I still have no clue what's going wrong though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
I wrote: > I'm not sure whether there's anything to be gained by leaving the tracing > code in there till we see actual buildfarm fails. There might be another > slowdown mechanism somewhere, but I rather doubt it. Thoughts? Hmmm ... I take that back. AFAICT, the failures on Noah's AIX zoo are sufficiently explained by the "mdpostckpt takes a long time after the regression tests" theory. However, there is something else happening on axolotl. Looking at the HEAD and 9.5 branches, there are three very similar failures in the ECPG step within the past 60 days: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-15%2018%3A49%3A31 http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2015-12-12%2001%3A44%3A39 In all three, we got "pg_ctl: server does not shut down", but the postmaster log claims that it shut down, and pretty speedily too. For example, in the 2015-12-12 failure, LOG: received fast shutdown request LOG: aborting any active transactions LOG: autovacuum launcher shutting down LOG: shutting down LOG: checkpoint starting: shutdown immediate LOG: checkpoint complete: wrote 176 buffers (1.1%); 0 transaction log file(s) added, 0 removed, 0 recycled; write=0.039 s, sync=0.000 s, total=0.059 s; sync files=0, longest=0.000 s, average=0.000 s; distance=978 kB, estimate=978 kB LOG: database system is shut down We have no theory that would account for postmaster shutdown stalling after the end of ShutdownXLOG, but that seems to be what happened. How come? Why does only the ECPG test seem to be affected? It's also pretty fishy that we have three failures in 60 days on HEAD+9.5 but none before that, and none in the older branches. That smells like a recently-introduced bug, though I have no idea what. Andrew, I wonder if I could prevail on you to make axolotl run "make check" on HEAD in src/interfaces/ecpg/ until it fails, so that we can see if the logging I added tells anything useful about this. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/09/2016 03:05 PM, Tom Lane wrote: I wrote: In any case, we should proceed with fixing things so that buildfarm owners can specify a higher shutdown timeout for especially slow critters. I looked into doing this as I suggested yesterday, namely modifying the buildfarm scripts, and soon decided that it would be a mess; there are too many cases where "pg_ctl stop" is not invoked directly by the script. I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and *removing* the two existing hacks in run_build.pl that try to force -t 120. The only real argument I can see against that approach is that we'd have to back-patch the PGCTLTIMEOUT patch to all active branches if we want to stop the buildfarm failures. We don't usually back-patch feature additions. On the other hand, this wouldn't be the first time we've back-patched something on grounds of helping the buildfarm, so I find that argument pretty weak. OK. I can put out a new release as required. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
I wrote: > In any case, we should proceed with fixing things so that buildfarm owners > can specify a higher shutdown timeout for especially slow critters. I looked into doing this as I suggested yesterday, namely modifying the buildfarm scripts, and soon decided that it would be a mess; there are too many cases where "pg_ctl stop" is not invoked directly by the script. I'm now in favor of applying the PGCTLTIMEOUT patch Noah proposed, and *removing* the two existing hacks in run_build.pl that try to force -t 120. The only real argument I can see against that approach is that we'd have to back-patch the PGCTLTIMEOUT patch to all active branches if we want to stop the buildfarm failures. We don't usually back-patch feature additions. On the other hand, this wouldn't be the first time we've back-patched something on grounds of helping the buildfarm, so I find that argument pretty weak. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
I wrote: > Noah Misch writes: >> On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote: >>> We've seen variants >>> on this theme on half a dozen machines just in the past week --- and it >>> seems to mostly happen in 9.5 and HEAD, which is fishy. >> It has been affecting only the four AIX animals, which do share hardware. >> (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.) > Certainly your AIX critters have shown this a bunch, but here's another > current example: > http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23 While we've not seen an actual test failure since I installed the tracing code, we do have reports from the four AIX critters, and they are pretty instructive. See for example http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-09%2016%3A27%3A05&stg=stopdb-C-2 which is Snapshot: 2016-02-09 16:27:05 waiting for server to shut down at 2016-02-09 18:04:09.159 UTC done server stopped === db log file == 2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:2] LOG: received fast shutdown request at 2016-02-09 18:04:09.159 UTC 2016-02-09 18:04:09.159 UTC [56ba2944.ee01bc:3] LOG: aborting any active transactions 2016-02-09 18:04:09.160 UTC [56ba2944.14501d8:2] LOG: autovacuum launcher shutting down at 2016-02-09 18:04:09.160 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:40] LOG: shutting down at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:41] LOG: CheckPointGuts starting at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:42] LOG: CheckPointCLOG() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:43] LOG: CheckPointCommitTs() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:44] LOG: CheckPointSUBTRANS() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:45] LOG: CheckPointMultiXact() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:46] LOG: CheckPointPredicate() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:47] LOG: CheckPointRelationMap() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.258 UTC [56ba2944.363000e:48] LOG: CheckPointReplicationSlots() done at 2016-02-09 18:04:09.258 UTC 2016-02-09 18:04:09.259 UTC [56ba2944.363000e:49] LOG: CheckPointSnapBuild() done at 2016-02-09 18:04:09.259 UTC 2016-02-09 18:04:09.259 UTC [56ba2944.363000e:50] LOG: CheckPointLogicalRewriteHeap() done at 2016-02-09 18:04:09.259 UTC 2016-02-09 18:04:09.259 UTC [56ba2944.363000e:51] LOG: BufferSync(5) beginning to write 632 buffers at 2016-02-09 18:04:09.259 UTC 2016-02-09 18:04:09.966 UTC [56ba2944.363000e:52] LOG: BufferSync(5) done, wrote 632/632 buffers at 2016-02-09 18:04:09.966 UTC 2016-02-09 18:04:09.966 UTC [56ba2944.363000e:53] LOG: CheckPointBuffers() done at 2016-02-09 18:04:09.966 UTC 2016-02-09 18:04:09.966 UTC [56ba2944.363000e:54] LOG: CheckPointReplicationOrigin() done at 2016-02-09 18:04:09.966 UTC 2016-02-09 18:04:09.966 UTC [56ba2944.363000e:55] LOG: CheckPointGuts done at 2016-02-09 18:04:09.966 UTC 2016-02-09 18:04:09.967 UTC [56ba2944.363000e:56] LOG: checkpoint WAL record flushed at 2016-02-09 18:04:09.967 UTC 2016-02-09 18:04:09.967 UTC [56ba2944.363000e:57] LOG: pg_control updated at 2016-02-09 18:04:09.967 UTC 2016-02-09 18:04:29.375 UTC [56ba2944.363000e:58] LOG: in mdpostckpt, 3072 unlinks remain to do at 2016-02-09 18:04:29.375 UTC 2016-02-09 18:04:48.207 UTC [56ba2944.363000e:59] LOG: in mdpostckpt, 2048 unlinks remain to do at 2016-02-09 18:04:48.207 UTC 2016-02-09 18:05:07.381 UTC [56ba2944.363000e:60] LOG: in mdpostckpt, 1024 unlinks remain to do at 2016-02-09 18:05:07.381 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:61] LOG: smgrpostckpt() done at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:62] LOG: RemoveOldXlogFiles() done at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:63] LOG: TruncateSUBTRANS() done at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:64] LOG: shutdown checkpoint complete at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:65] LOG: ShutdownCLOG() complete at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:66] LOG: ShutdownCommitTs() complete at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:67] LOG: ShutdownSUBTRANS() complete at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:25.764 UTC [56ba2944.363000e:68] LOG: database system is shut down at 2016-02-09 18:05:25.764 UTC 2016-02-09 18:05:26.165 UTC [56ba2944.ee01bc:4] LOG: lock files all released at 2016-02-09 18:05:26.165 UTC This says that the bulk of the shutdown time
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/08/2016 10:55 PM, Tom Lane wrote: Noah Misch writes: On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote: We've seen variants on this theme on half a dozen machines just in the past week --- and it seems to mostly happen in 9.5 and HEAD, which is fishy. It has been affecting only the four AIX animals, which do share hardware. (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.) Certainly your AIX critters have shown this a bunch, but here's another current example: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23 That's reasonable. If you would like higher-fidelity data, I can run loops of "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run that for HEAD and 9.2 simultaneously. A day of logs from that should show clearly if HEAD is systematically worse than 9.2. That sounds like a fine plan, please do it. So, I wish to raise the timeout for those animals. Using an environment variable was a good idea; it's one less thing for test authors to remember. Since the variable affects a performance-related fudge factor rather than change behavior per se, I'm less skittish than usual about unintended consequences of dynamic scope. (With said unintended consequences in mind, I made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into the service created.) While this isn't necessarily a bad idea in isolation, the current buildfarm scripts explicitly specify a -t value to pg_ctl stop, which I would not expect an environment variable to override. So we need to fix the buildfarm script to allow the timeout to be configurable. I'm not sure if there would be any value-add in having pg_ctl answer to an environment variable once we've done that. The failure on axolotl was for the ECPG tests, which don't use the buildfarm's startup/stop db code. They don't honour TEMP_CONFIG either, which they probably should - then we might get better log traces. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Noah Misch writes: > On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote: >> We've seen variants >> on this theme on half a dozen machines just in the past week --- and it >> seems to mostly happen in 9.5 and HEAD, which is fishy. > It has been affecting only the four AIX animals, which do share hardware. > (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.) Certainly your AIX critters have shown this a bunch, but here's another current example: http://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=axolotl&dt=2016-02-08%2014%3A49%3A23 > That's reasonable. If you would like higher-fidelity data, I can run loops of > "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run > that for HEAD and 9.2 simultaneously. A day of logs from that should show > clearly if HEAD is systematically worse than 9.2. That sounds like a fine plan, please do it. > So, I wish to raise the timeout for those animals. Using an environment > variable was a good idea; it's one less thing for test authors to remember. > Since the variable affects a performance-related fudge factor rather than > change behavior per se, I'm less skittish than usual about unintended > consequences of dynamic scope. (With said unintended consequences in mind, I > made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into > the service created.) While this isn't necessarily a bad idea in isolation, the current buildfarm scripts explicitly specify a -t value to pg_ctl stop, which I would not expect an environment variable to override. So we need to fix the buildfarm script to allow the timeout to be configurable. I'm not sure if there would be any value-add in having pg_ctl answer to an environment variable once we've done that. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On Mon, Feb 08, 2016 at 02:15:48PM -0500, Tom Lane wrote: > Of late, by far the majority of the random-noise failures we see in the > buildfarm have come from failure to shut down the postmaster in a > reasonable timeframe. > We've seen variants > on this theme on half a dozen machines just in the past week --- and it > seems to mostly happen in 9.5 and HEAD, which is fishy. It has been affecting only the four AIX animals, which do share hardware. (Back in 2015 and once in 2016-01, it did affect axolotl and shearwater.) I agree the concentration on 9.5 and HEAD is suspicious; while those branches get the most buildfarm runs, that factor by itself doesn't explain the distribution of failures among versions. > What I'd like to do to investigate this is put in a temporary HEAD-only > patch that makes ShutdownXLOG() and its subroutines much chattier about > how far they've gotten and what time it is, and also makes pg_ctl print > out the current time if it gives up waiting. A few failed runs with > that in place will at least allow us to confirm or deny whether it's > just that the shutdown checkpoint is sometimes really slow, or whether > there's a bug lurking. > > Any objections? Anybody have another idea for data to collect? That's reasonable. If you would like higher-fidelity data, I can run loops of "pg_ctl -w start; make installcheck; pg_ctl -t900 -w stop", and I could run that for HEAD and 9.2 simultaneously. A day of logs from that should show clearly if HEAD is systematically worse than 9.2. By the way, you would almost surely qualify for an account on this machine. I had drafted the following message and patch last week, and I suppose it belongs in this thread: On Mon, Oct 12, 2015 at 06:41:06PM -0400, Tom Lane wrote: > I'm not sure if this will completely fix our problems with "pg_ctl start" > related buildfarm failures on very slow critters. It does get rid of the > hard wired 5-second timeout, but the 60-second timeout could still be an > issue. I think Noah was considering a patch to allow that number to be > raised. I'd be in favor of letting pg_ctl accept a default timeout length > from an environment variable, and then the slower critters could be fixed > by adjusting their buildfarm configurations. Your commit 6bcce25 made src/bin test suites stop failing due to pg_ctl startup timeouts, but other suites have been failing on the AIX buildfarm zoo due to slow shutdown. Example taking 72s to even reach ShutdownXLOG(): http://www.pgbuildfarm.org/cgi-bin/show_log.pl?nm=sungazer&dt=2016-01-15%2012%3A43%3A00 So, I wish to raise the timeout for those animals. Using an environment variable was a good idea; it's one less thing for test authors to remember. Since the variable affects a performance-related fudge factor rather than change behavior per se, I'm less skittish than usual about unintended consequences of dynamic scope. (With said unintended consequences in mind, I made "pg_ctl register" ignore PGCTLTIMEOUT rather than embed its value into the service created.) Thanks, nm diff --git a/doc/src/sgml/ref/pg_ctl-ref.sgml b/doc/src/sgml/ref/pg_ctl-ref.sgml index eaa0cc8..6ceb781 100644 --- a/doc/src/sgml/ref/pg_ctl-ref.sgml +++ b/doc/src/sgml/ref/pg_ctl-ref.sgml @@ -362,7 +362,9 @@ PostgreSQL documentation The maximum number of seconds to wait when waiting for startup or -shutdown to complete. The default is 60 seconds. +shutdown to complete. Defaults to the value of the +PGCTLTIMEOUT environment variable or, if not set, to 60 +seconds. @@ -487,6 +489,17 @@ PostgreSQL documentation +PGCTLTIMEOUT + + + + Default limit on the number of seconds to wait when waiting for startup + or shutdown to complete. If not set, the default is 60 seconds. + + + + + PGDATA diff --git a/src/bin/pg_ctl/pg_ctl.c b/src/bin/pg_ctl/pg_ctl.c index 9da38c4..bae6c22 100644 --- a/src/bin/pg_ctl/pg_ctl.c +++ b/src/bin/pg_ctl/pg_ctl.c @@ -72,6 +72,7 @@ typedef enum static bool do_wait = false; static bool wait_set = false; static int wait_seconds = DEFAULT_WAIT; +static bool wait_seconds_arg = false; static bool silent_mode = false; static ShutdownMode shutdown_mode = FAST_MODE; static int sig = SIGINT; /* default */ @@ -1431,7 +1432,8 @@ pgwin32_CommandLine(bool registration) if (registration && do_wait) appendPQExpBuffer(cmdLine, " -w"); - if (registration && wait_seconds != DEFAULT_WAIT) + /* Don't propagate a value from an environment variable. */ + if (registration && wait_seconds_arg && wait_seconds != DEFAULT_WAIT) appendPQExpBuffer(cmdLine, " -t %d", wait_seconds); if (registration && silent_mode) @@ -2128,6 +2130,7 @@ main(int argc, char **argv) {NULL, 0, NULL, 0} }; + char *env_wait; int option_
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
On 02/08/2016 02:15 PM, Tom Lane wrote: Of late, by far the majority of the random-noise failures we see in the buildfarm have come from failure to shut down the postmaster in a reasonable timeframe. An example is this current failure on hornet: http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55 waiting for server to shut down... failed pg_ctl: server does not shut down === db log file == 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG: received fast shutdown request 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG: aborting any active transactions 2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG: autovacuum launcher shutting down 2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG: shutting down The buildfarm script runs pg_ctl stop with -t 120, and counting the dots shows that pg_ctl was honoring that, so apparently the postmaster took more than 2 minutes to shut down. Looking at other recent successful runs, stopdb-C-1 usually takes 30 to 40 or so seconds on this machine. So it's possible that it was just so overloaded that it took three times that much time on this run, but I'm starting to think there may be more to it than that. We've seen variants on this theme on half a dozen machines just in the past week --- and it seems to mostly happen in 9.5 and HEAD, which is fishy. The fact that we got "shutting down" but not "database system is shut down" indicates that the server was in ShutdownXLOG(). Normally the only component of that that takes much time is buffer flushing, but could something else be happening? Or perhaps the checkpoint code has somehow not gotten the word to do its flushing at full speed? What I'd like to do to investigate this is put in a temporary HEAD-only patch that makes ShutdownXLOG() and its subroutines much chattier about how far they've gotten and what time it is, and also makes pg_ctl print out the current time if it gives up waiting. A few failed runs with that in place will at least allow us to confirm or deny whether it's just that the shutdown checkpoint is sometimes really slow, or whether there's a bug lurking. Any objections? Anybody have another idea for data to collect? I think that's an excellent idea. This has been a minor running sore for ages on slow machines. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Alvaro Herrera writes: > Tom Lane wrote: >> What I'd like to do to investigate this is put in a temporary HEAD-only >> patch that makes ShutdownXLOG() and its subroutines much chattier about >> how far they've gotten and what time it is, and also makes pg_ctl print >> out the current time if it gives up waiting. > Seems like a reasonable place to start. I assume you'll be installing > some debug-elog calls, enabled by default, and then once the problem is > fixed, we'll change the default to disabled but keep the actual calls? I had in mind to just "git revert" the patch when we're done with it. There might be some parts we want to keep (for instance, I'm thinking of logging "postmaster is done" after we've unlinked the pidfile, which might be useful permanently). But I plan to err on the side of noisiness for the moment, rather than make something I think has long-term value. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Tom Lane wrote: > Of late, by far the majority of the random-noise failures we see in the > buildfarm have come from failure to shut down the postmaster in a > reasonable timeframe. I noticed that. > An example is this current failure on hornet: > > http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55 FWIW you seem to have edited this URL -- it returns a blank page. The right one is http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55&stg=stopdb-C-1 > What I'd like to do to investigate this is put in a temporary HEAD-only > patch that makes ShutdownXLOG() and its subroutines much chattier about > how far they've gotten and what time it is, and also makes pg_ctl print > out the current time if it gives up waiting. A few failed runs with > that in place will at least allow us to confirm or deny whether it's > just that the shutdown checkpoint is sometimes really slow, or whether > there's a bug lurking. > > Any objections? Anybody have another idea for data to collect? Seems like a reasonable place to start. I assume you'll be installing some debug-elog calls, enabled by default, and then once the problem is fixed, we'll change the default to disabled but keep the actual calls? -- Álvaro Herrerahttp://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] Tracing down buildfarm "postmaster does not shut down" failures
Of late, by far the majority of the random-noise failures we see in the buildfarm have come from failure to shut down the postmaster in a reasonable timeframe. An example is this current failure on hornet: http://buildfarm.postgresql.org/cgi-bin/show_stage_log.pl?nm=hornet&dt=2016-02-08%2013%3A41%3A55 waiting for server to shut down... failed pg_ctl: server does not shut down === db log file == 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:2] LOG: received fast shutdown request 2016-02-08 15:14:35.783 UTC [56b8b00d.9e00e2:3] LOG: aborting any active transactions 2016-02-08 15:14:35.783 UTC [56b8b00d.7e0028:2] LOG: autovacuum launcher shutting down 2016-02-08 15:14:35.865 UTC [56b8b00d.2e5000a:1] LOG: shutting down The buildfarm script runs pg_ctl stop with -t 120, and counting the dots shows that pg_ctl was honoring that, so apparently the postmaster took more than 2 minutes to shut down. Looking at other recent successful runs, stopdb-C-1 usually takes 30 to 40 or so seconds on this machine. So it's possible that it was just so overloaded that it took three times that much time on this run, but I'm starting to think there may be more to it than that. We've seen variants on this theme on half a dozen machines just in the past week --- and it seems to mostly happen in 9.5 and HEAD, which is fishy. The fact that we got "shutting down" but not "database system is shut down" indicates that the server was in ShutdownXLOG(). Normally the only component of that that takes much time is buffer flushing, but could something else be happening? Or perhaps the checkpoint code has somehow not gotten the word to do its flushing at full speed? What I'd like to do to investigate this is put in a temporary HEAD-only patch that makes ShutdownXLOG() and its subroutines much chattier about how far they've gotten and what time it is, and also makes pg_ctl print out the current time if it gives up waiting. A few failed runs with that in place will at least allow us to confirm or deny whether it's just that the shutdown checkpoint is sometimes really slow, or whether there's a bug lurking. Any objections? Anybody have another idea for data to collect? regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers