Yes, the bootstrap shell script is broken. I filed https://issues.apache.org/jira/browse/COUCHDB-1885 but that has a stupid title and doesn't quite capture how broken it is. Basically, some of the -k/-s logic got borked a while back and so IIRC you can't request a graceful restart of CouchDB via the shell script (you have to kill the beam process *yourself* and then the script will reload it).
That aside, I don't think that is related in this case. At least the last time this instance went down, the Erlang process _was still running_ just not accepting network connections. So from the shell script's perspective, it didn't see the need to restart. hth, -natevw On Oct 31, 2013, at 9:30 PM, Jim Klo <[email protected]> wrote: > I noticed this myself (the bootstrap shell script not working). I vaguely > recall that determining that the watchdog process doesn't correctly monitor > the pid file. The logic in general was off - basically there's an edge > condition not accounted for. I don't remember if I fixed the script or not - > I'd have to hunt through my notes when I get back to a real computer. > Something tells me I wrapped in a cron to clean up and restart as I was under > a timeline before the world nearly came to an end earlier this month. > > Jim Klo > Senior Software Engineer > SRI International > t: @nsomnac > > On Oct 31, 2013, at 5:19 PM, "Nathan Vander Wilt" > <[email protected]<mailto:[email protected]>> wrote: > > Okay, may have figured out why the shell script isn't restarting Couch. It > seems it may not actually die all the way. I can't connect to it, but there > is a process matching the pidfile: > > 6417 ? Sl 14:34 > /home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 -- > -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu > -- -noshell -noinput -os_mon start_memsup false start_cpu_sup false > disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type > error -couch_ini bc2/build/etc/couchdb/default.ini production_couch/local.ini > -s couch -pidfile production_couch/couch.pid -heart > > hth, > -nvw > > > > On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt > <[email protected]<mailto:[email protected]>> wrote: > > Aaaand my Couch commited suicide again today. Unless this is something > different, I may have finally gotten lucky and had CouchDB leave a note > [eerily unfinished!] in the logs this time: > https://gist.github.com/natevw/fd509978516499ba128b > > ``` > ** Reason == {badarg, > [{io,put_chars, > [<0.93.0>,unicode, > <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] > 66.249.66.216 - - GET > /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor > 200\n">>], > []}, > ``` > > So…now what? I have a rebuilt version of CouchDB I'm going to try [once I > figure out why *it* isn't starting] but this is still really upsetting — I'm > aware I could add my own cronjob or something to check and restart if needed > every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and > b) it's NOT and c) this is embarrassing and aggravating. > > thanks, > -natevw > > > On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt > <[email protected]<mailto:[email protected]>> wrote: > > I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and > configuration options…]` and keep pulling up my sites finding them dead too. > Seems to be about the same thing as others are reporting in this old > thread…was there any resolution? > > This is not an OOM thing, in dmesg I do see some killed processes (node) but > never couchdb/beam and NOTHING killed after I added swap some several days > ago. CouchDB was dead again this morning. > > The only trace of trouble in the logs is in couch.stderr: > > ``` > heart_beat_kill_pid = 32575 > heart_beat_timeout = 11 > heart: Sat Oct 5 02:59:16 2013: heart-beat time-out, no activity for 12 > seconds > Killed > heart: Sat Oct 5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb > -k" -> 256. Terminating. > > heart_beat_kill_pid = 13781 > heart_beat_timeout = 11 > heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 > seconds > Killed > heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb > -k" -> 256. Terminating. > > heart_beat_kill_pid = 15292 > heart_beat_timeout = 11 > heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 > seconds > Killed > heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb > -k" -> 256. Terminating. > > heart_beat_kill_pid = 29158 > heart_beat_timeout = 11 > ``` > > 1. What are these "heart-beat time-out" logs about? Is that a clue to the > trouble? > 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds > like I told it to? > > `erlang:display(erlang:system_info(otp_release)).` says R15B > > thanks, > -natevw > > > > On Sep 13, 2013, at 3:20 PM, James Marca > <[email protected]<mailto:[email protected]>> wrote: > > I am seeing a lot of random, silent crashes on just *one* of my > CouchDB servers. > > couchdb version 1.4.0 (gentoo ebuild) > > erlang also from gentoo ebuild: > Erlang (BEAM) emulator version 5.10.2 > Compiled on Fri Sep 13 08:39:20 2013 > Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8] > [async-threads:10] [kernel-poll:false] > > I've got 3 servers running couchdb, A, B, C, and only B is crashing. > All of them are replicating a single db between them, with B acting as > the "hub"...A pushes to B, B pushes to both A and C, and C pushes to > B. > > All three servers have data crunching jobs running that are reading > and writing to the database that is being replicated around. > > The B server, the one in the middle that is push replicating to both A > and C, is the one that is crashing. > > The log looks like this: > > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET > /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET > /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started > on http://0.0.0.0:5984/ > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start > replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` > (document `carb_grid_state4k_push_emma64`). > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start > replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document > `carb_grid_state5k_hpms_push_kitty`). > > no warning that the server died or why, and nothing in the > /var/log/messages about anything untoward happening (no OOM killer > invoked or anything like that) > > The restart only happened because I manually did a > /etc/init.d/couchdb restart > Usually couchdb restarts itself, but not with this crash. > > > > I flipped the log to debug level, and still had no warning about the crash: > > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' > /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy" > Headers: [{'Accept',"application/json"}, > {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="}, > {'Content-Length',"346"}, > {'Content-Type',"application/json"}, > {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"}, > {'User-Agent',"CouchDB/1.4.0"}, > {"X-Couch-Full-Commit","false"}] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: [] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch > of size 128531 bytes > > And that was it. CouchDB was down and out. > > I even tried shutting off the data processing (so as to reduce the db > load) on box B, but that didn't help (all the crashing has put it far > behind in replicating box A and C). > > My guess is that the replication load is too big (too many > connections, too much data being pushed in), but I would expect some > sort of warning before the server dies. > > Any clues or suggestions would be appreciated. I am currently going > to try compling from source directly, but I don't have much faith that > it will make a difference. > > Thanks, > James Marca > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. > > > >
