I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and configuration options…]` and keep pulling up my sites finding them dead too. Seems to be about the same thing as others are reporting in this old thread…was there any resolution?
This is not an OOM thing, in dmesg I do see some killed processes (node) but never couchdb/beam and NOTHING killed after I added swap some several days ago. CouchDB was dead again this morning. The only trace of trouble in the logs is in couch.stderr: ``` heart_beat_kill_pid = 32575 heart_beat_timeout = 11 heart: Sat Oct 5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds Killed heart: Sat Oct 5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. heart_beat_kill_pid = 13781 heart_beat_timeout = 11 heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds Killed heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. heart_beat_kill_pid = 15292 heart_beat_timeout = 11 heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds Killed heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating. heart_beat_kill_pid = 29158 heart_beat_timeout = 11 ``` 1. What are these "heart-beat time-out" logs about? Is that a clue to the trouble? 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds like I told it to? `erlang:display(erlang:system_info(otp_release)).` says R15B thanks, -natevw On Sep 13, 2013, at 3:20 PM, James Marca <[email protected]> wrote: > I am seeing a lot of random, silent crashes on just *one* of my > CouchDB servers. > > couchdb version 1.4.0 (gentoo ebuild) > > erlang also from gentoo ebuild: > Erlang (BEAM) emulator version 5.10.2 > Compiled on Fri Sep 13 08:39:20 2013 > Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8] > [async-threads:10] [kernel-poll:false] > > I've got 3 servers running couchdb, A, B, C, and only B is crashing. > All of them are replicating a single db between them, with B acting as > the "hub"...A pushes to B, B pushes to both A and C, and C pushes to > B. > > All three servers have data crunching jobs running that are reading > and writing to the database that is being replicated around. > > The B server, the one in the middle that is push replicating to both A > and C, is the one that is crashing. > > The log looks like this: > > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET > /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET > /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404 > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started > on http://0.0.0.0:5984/ > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start > replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` > (document `carb_grid_state4k_push_emma64`). > [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start > replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document > `carb_grid_state5k_hpms_push_kitty`). > > no warning that the server died or why, and nothing in the > /var/log/messages about anything untoward happening (no OOM killer > invoked or anything like that) > > The restart only happened because I manually did a > /etc/init.d/couchdb restart > Usually couchdb restarts itself, but not with this crash. > > > > I flipped the log to debug level, and still had no warning about the crash: > > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' > /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy" > Headers: [{'Accept',"application/json"}, > {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="}, > {'Content-Length',"346"}, > {'Content-Type',"application/json"}, > {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"}, > {'User-Agent',"CouchDB/1.4.0"}, > {"X-Couch-Full-Commit","false"}] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: [] > [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch > of size 128531 bytes > > And that was it. CouchDB was down and out. > > I even tried shutting off the data processing (so as to reduce the db > load) on box B, but that didn't help (all the crashing has put it far > behind in replicating box A and C). > > My guess is that the replication load is too big (too many > connections, too much data being pushed in), but I would expect some > sort of warning before the server dies. > > Any clues or suggestions would be appreciated. I am currently going > to try compling from source directly, but I don't have much faith that > it will make a difference. > > Thanks, > James Marca > > -- > This message has been scanned for viruses and > dangerous content by MailScanner, and is > believed to be clean. >
