Yes, the bootstrap shell script is broken. I filed 
https://issues.apache.org/jira/browse/COUCHDB-1885 but that has a stupid title 
and doesn't quite capture how broken it is. Basically, some of the -k/-s logic 
got borked a while back and so IIRC you can't request a graceful restart of 
CouchDB via the shell script (you have to kill the beam process *yourself* and 
then the script will reload it).

That aside, I don't think that is related in this case. At least the last time 
this instance went down, the Erlang process _was still running_ just not 
accepting network connections. So from the shell script's perspective, it 
didn't see the need to restart.

hth,
-natevw


On Oct 31, 2013, at 9:30 PM, Jim Klo <[email protected]> wrote:

> I noticed this myself (the bootstrap shell script not working). I vaguely 
> recall that determining that the watchdog process doesn't correctly monitor 
> the pid file. The logic in general was off - basically there's an edge 
> condition not accounted for. I don't remember if I fixed the script or not - 
> I'd have to hunt through my notes when I get back to a real computer. 
> Something tells me I wrapped in a cron to clean up and restart as I was under 
> a timeline before the world nearly came to an end earlier this month.
> 
> Jim Klo
> Senior Software Engineer
> SRI International
> t: @nsomnac
> 
> On Oct 31, 2013, at 5:19 PM, "Nathan Vander Wilt" 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Okay, may have figured out why the shell script isn't restarting Couch. It 
> seems it may not actually die all the way. I can't connect to it, but there 
> is a process matching the pidfile:
> 
> 6417 ?        Sl    14:34 
> /home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 -- 
> -root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu 
> -- -noshell -noinput -os_mon start_memsup false start_cpu_sup false 
> disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type 
> error -couch_ini bc2/build/etc/couchdb/default.ini production_couch/local.ini 
> -s couch -pidfile production_couch/couch.pid -heart
> 
> hth,
> -nvw
> 
> 
> 
> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> Aaaand my Couch commited suicide again today. Unless this is something 
> different, I may have finally gotten lucky and had CouchDB leave a note 
> [eerily unfinished!] in the logs this time:
> https://gist.github.com/natevw/fd509978516499ba128b
> 
> ```
> ** Reason == {badarg,
>                [{io,put_chars,
>                     [<0.93.0>,unicode,
>                      <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 
> 66.249.66.216 - - GET 
> /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor
>  200\n">>],
>                     []},
> ```
> 
> So…now what? I have a rebuilt version of CouchDB I'm going to try [once I 
> figure out why *it* isn't starting] but this is still really upsetting — I'm 
> aware I could add my own cronjob or something to check and restart if needed 
> every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and 
> b) it's NOT and c) this is embarrassing and aggravating.
> 
> thanks,
> -natevw
> 
> 
> On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and 
> configuration options…]` and keep pulling up my sites finding them dead too. 
> Seems to be about the same thing as others are reporting in this old 
> thread…was there any resolution?
> 
> This is not an OOM thing, in dmesg I do see some killed processes (node) but 
> never couchdb/beam and NOTHING killed after I added swap some several days 
> ago. CouchDB was dead again this morning.
> 
> The only trace of trouble in the logs is in couch.stderr:
> 
> ```
> heart_beat_kill_pid = 32575
> heart_beat_timeout = 11
> heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 
> seconds
> Killed
> heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb 
> -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 13781
> heart_beat_timeout = 11
> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 
> seconds
> Killed
> heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb 
> -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 15292
> heart_beat_timeout = 11
> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 
> seconds
> Killed
> heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb 
> -k" -> 256. Terminating.
> 
> heart_beat_kill_pid = 29158
> heart_beat_timeout = 11
> ```
> 
> 1. What are these "heart-beat time-out" logs about? Is that a clue to the 
> trouble?
> 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds 
> like I told it to?
> 
> `erlang:display(erlang:system_info(otp_release)).`  says R15B
> 
> thanks,
> -natevw
> 
> 
> 
> On Sep 13, 2013, at 3:20 PM, James Marca 
> <[email protected]<mailto:[email protected]>> wrote:
> 
> I am seeing a lot of random, silent crashes on just *one* of my
> CouchDB servers.
> 
> couchdb version 1.4.0 (gentoo ebuild)
> 
> erlang also from gentoo ebuild:
> Erlang (BEAM) emulator version 5.10.2
> Compiled on Fri Sep 13 08:39:20 2013
> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
> [async-threads:10] [kernel-poll:false]
> 
> I've got 3 servers running couchdb, A, B, C, and only B is crashing.
> All of them are replicating a single db between them, with B acting as
> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
> B.
> 
> All three servers have data crunching jobs running that are reading
> and writing to the database that is being replicated around.
> 
> The B server, the one in the middle that is push replicating to both A
> and C, is the one that is crashing.
> 
> The log looks like this:
> 
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET 
> /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET 
> /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started 
> on http://0.0.0.0:5984/
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start 
> replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` 
> (document `carb_grid_state4k_push_emma64`).
> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start 
> replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document 
> `carb_grid_state5k_hpms_push_kitty`).
> 
> no warning that the server died or why, and nothing in the
> /var/log/messages about anything untoward  happening (no OOM killer
> invoked or anything like that)
> 
> The restart only happened because I manually did a
> /etc/init.d/couchdb restart
> Usually couchdb restarts itself, but not with this crash.
> 
> 
> 
> I flipped the log to debug level, and still had no warning about the crash:
> 
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' 
> /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
> Headers: [{'Accept',"application/json"},
>       {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
>       {'Content-Length',"346"},
>       {'Content-Type',"application/json"},
>       {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
>       {'User-Agent',"CouchDB/1.4.0"},
>       {"X-Couch-Full-Commit","false"}]
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch 
> of size 128531 bytes
> 
> And that was it.  CouchDB was down and out.
> 
> I even tried shutting off the data processing (so as to reduce the db
> load) on box B, but that didn't help (all the crashing has put it far
> behind in replicating box A and C).
> 
> My guess is that the replication load is too big (too many
> connections, too much data being pushed in), but I would expect some
> sort of warning before the server dies.
> 
> Any clues or suggestions would be appreciated.  I am currently going
> to try compling from source directly, but I don't have much faith that
> it will make a difference.
> 
> Thanks,
> James Marca
> 
> --
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.
> 
> 
> 
> 

Reply via email to