Okay, may have figured out why the shell script isn't restarting Couch. It seems it may not actually die all the way. I can't connect to it, but there is a process matching the pidfile:
6417 ? Sl 14:34 /home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 -- -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu -- -noshell -noinput -os_mon start_memsup false start_cpu_sup false disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type error -couch_ini bc2/build/etc/couchdb/default.ini production_couch/local.ini -s couch -pidfile production_couch/couch.pid -heart hth, -nvw On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt <[email protected]> wrote: > Aaaand my Couch commited suicide again today. Unless this is something > different, I may have finally gotten lucky and had CouchDB leave a note > [eerily unfinished!] in the logs this time: > https://gist.github.com/natevw/fd509978516499ba128b > > ``` > ** Reason == {badarg, > [{io,put_chars, > [<0.93.0>,unicode, > <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] > [<0.31789.2>] 66.249.66.216 - - GET > /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor > 200\n">>], > []}, > ``` > > So…now what? I have a rebuilt version of CouchDB I'm going to try [once I > figure out why *it* isn't starting] but this is still really upsetting — I'm > aware I could add my own cronjob or something to check and restart if needed > every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and > b) it's NOT and c) this is embarrassing and aggravating. > > thanks, > -natevw > > > On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt <[email protected]> > wrote: > >> I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output >> and configuration options…]` and keep pulling up my sites finding them dead >> too. Seems to be about the same thing as others are reporting in this old >> thread…was there any resolution? >> >> This is not an OOM thing, in dmesg I do see some killed processes (node) but >> never couchdb/beam and NOTHING killed after I added swap some several days >> ago. CouchDB was dead again this morning. >> >> The only trace of trouble in the logs is in couch.stderr: >> >> ``` >> heart_beat_kill_pid = 32575 >> heart_beat_timeout = 11 >> heart: Sat Oct 5 02:59:16 2013: heart-beat time-out, no activity for 12 >> seconds >> Killed >> heart: Sat Oct 5 02:59:18 2013: Executed >> "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. >> >> heart_beat_kill_pid = 13781 >> heart_beat_timeout = 11 >> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 >> seconds >> Killed >> heart: Tue Oct 22 19:51:11 2013: Executed >> "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. >> >> heart_beat_kill_pid = 15292 >> heart_beat_timeout = 11 >> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 >> seconds >> Killed >> heart: Tue Oct 29 12:33:18 2013: Executed >> "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. >> >> heart_beat_kill_pid = 29158 >> heart_beat_timeout = 11 >> ``` >> >> 1. What are these "heart-beat time-out" logs about? Is that a clue to the >> trouble? >> 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds >> like I told it to? >> >> `erlang:display(erlang:system_info(otp_release)).` says R15B >> >> thanks, >> -natevw >> >> >> >> On Sep 13, 2013, at 3:20 PM, James Marca <[email protected]> wrote: >> >>> I am seeing a lot of random, silent crashes on just *one* of my >>> CouchDB servers. >>> >>> couchdb version 1.4.0 (gentoo ebuild) >>> >>> erlang also from gentoo ebuild: >>> Erlang (BEAM) emulator version 5.10.2 >>> Compiled on Fri Sep 13 08:39:20 2013 >>> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8] >>> [async-threads:10] [kernel-poll:false] >>> >>> I've got 3 servers running couchdb, A, B, C, and only B is crashing. >>> All of them are replicating a single db between them, with B acting as >>> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to >>> B. >>> >>> All three servers have data crunching jobs running that are reading >>> and writing to the database that is being replicated around. >>> >>> The B server, the one in the middle that is push replicating to both A >>> and C, is the one that is crashing. >>> >>> The log looks like this: >>> >>> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET >>> /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404 >>> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET >>> /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404 >>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has >>> started on http://0.0.0.0:5984/ >>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start >>> replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` >>> (document `carb_grid_state4k_push_emma64`). >>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start >>> replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document >>> `carb_grid_state5k_hpms_push_kitty`). >>> >>> no warning that the server died or why, and nothing in the >>> /var/log/messages about anything untoward happening (no OOM killer >>> invoked or anything like that) >>> >>> The restart only happened because I manually did a >>> /etc/init.d/couchdb restart >>> Usually couchdb restarts itself, but not with this crash. >>> >>> >>> >>> I flipped the log to debug level, and still had no warning about the crash: >>> >>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' >>> /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy" >>> Headers: [{'Accept',"application/json"}, >>> {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="}, >>> {'Content-Length',"346"}, >>> {'Content-Type',"application/json"}, >>> {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"}, >>> {'User-Agent',"CouchDB/1.4.0"}, >>> {"X-Couch-Full-Commit","false"}] >>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: [] >>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc >>> batch of size 128531 bytes >>> >>> And that was it. CouchDB was down and out. >>> >>> I even tried shutting off the data processing (so as to reduce the db >>> load) on box B, but that didn't help (all the crashing has put it far >>> behind in replicating box A and C). >>> >>> My guess is that the replication load is too big (too many >>> connections, too much data being pushed in), but I would expect some >>> sort of warning before the server dies. >>> >>> Any clues or suggestions would be appreciated. I am currently going >>> to try compling from source directly, but I don't have much faith that >>> it will make a difference. >>> >>> Thanks, >>> James Marca >>> >>> -- >>> This message has been scanned for viruses and >>> dangerous content by MailScanner, and is >>> believed to be clean. >>> >> >
