Okay, may have figured out why the shell script isn't restarting Couch. It 
seems it may not actually die all the way. I can't connect to it, but there is 
a process matching the pidfile:

 6417 ?        Sl    14:34 
/home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 -- 
-root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu -- 
-noshell -noinput -os_mon start_memsup false start_cpu_sup false 
disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type 
error -couch_ini bc2/build/etc/couchdb/default.ini production_couch/local.ini 
-s couch -pidfile production_couch/couch.pid -heart

hth,
-nvw



On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt <[email protected]> 
wrote:

> Aaaand my Couch commited suicide again today. Unless this is something 
> different, I may have finally gotten lucky and had CouchDB leave a note 
> [eerily unfinished!] in the logs this time:
> https://gist.github.com/natevw/fd509978516499ba128b
> 
> ```
> ** Reason == {badarg,
>                  [{io,put_chars,
>                       [<0.93.0>,unicode,
>                        <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] 
> [<0.31789.2>] 66.249.66.216 - - GET 
> /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor
>  200\n">>],
>                       []},
> ```
> 
> So…now what? I have a rebuilt version of CouchDB I'm going to try [once I 
> figure out why *it* isn't starting] but this is still really upsetting — I'm 
> aware I could add my own cronjob or something to check and restart if needed 
> every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and 
> b) it's NOT and c) this is embarrassing and aggravating.
> 
> thanks,
> -natevw
> 
> 
> On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt <[email protected]> 
> wrote:
> 
>> I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output 
>> and configuration options…]` and keep pulling up my sites finding them dead 
>> too. Seems to be about the same thing as others are reporting in this old 
>> thread…was there any resolution?
>> 
>> This is not an OOM thing, in dmesg I do see some killed processes (node) but 
>> never couchdb/beam and NOTHING killed after I added swap some several days 
>> ago. CouchDB was dead again this morning.
>> 
>> The only trace of trouble in the logs is in couch.stderr:
>> 
>> ```
>> heart_beat_kill_pid = 32575
>> heart_beat_timeout = 11
>> heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 
>> seconds
>> Killed
>> heart: Sat Oct  5 02:59:18 2013: Executed 
>> "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>> 
>> heart_beat_kill_pid = 13781
>> heart_beat_timeout = 11
>> heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 
>> seconds
>> Killed
>> heart: Tue Oct 22 19:51:11 2013: Executed 
>> "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>> 
>> heart_beat_kill_pid = 15292
>> heart_beat_timeout = 11
>> heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 
>> seconds
>> Killed
>> heart: Tue Oct 29 12:33:18 2013: Executed 
>> "/home/ubuntu/bc2/build/bin/couchdb -k" -> 256. Terminating.
>> 
>> heart_beat_kill_pid = 29158
>> heart_beat_timeout = 11
>> ```
>> 
>> 1. What are these "heart-beat time-out" logs about? Is that a clue to the 
>> trouble?
>> 2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds 
>> like I told it to?
>> 
>> `erlang:display(erlang:system_info(otp_release)).`  says R15B
>> 
>> thanks,
>> -natevw
>> 
>> 
>> 
>> On Sep 13, 2013, at 3:20 PM, James Marca <[email protected]> wrote:
>> 
>>> I am seeing a lot of random, silent crashes on just *one* of my
>>> CouchDB servers.
>>> 
>>> couchdb version 1.4.0 (gentoo ebuild)
>>> 
>>> erlang also from gentoo ebuild: 
>>> Erlang (BEAM) emulator version 5.10.2
>>> Compiled on Fri Sep 13 08:39:20 2013
>>> Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
>>> [async-threads:10] [kernel-poll:false]
>>> 
>>> I've got 3 servers running couchdb, A, B, C, and only B is crashing.
>>> All of them are replicating a single db between them, with B acting as
>>> the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
>>> B.
>>> 
>>> All three servers have data crunching jobs running that are reading
>>> and writing to the database that is being replicated around.
>>> 
>>> The B server, the one in the middle that is push replicating to both A
>>> and C, is the one that is crashing.
>>> 
>>> The log looks like this:
>>> 
>>> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET 
>>> /carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
>>> [Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET 
>>> /carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
>>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has 
>>> started on http://0.0.0.0:5984/
>>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start 
>>> replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` 
>>> (document `carb_grid_state4k_push_emma64`).
>>> [Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start 
>>> replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document 
>>> `carb_grid_state5k_hpms_push_kitty`).
>>> 
>>> no warning that the server died or why, and nothing in the
>>> /var/log/messages about anything untoward  happening (no OOM killer
>>> invoked or anything like that)
>>> 
>>> The restart only happened because I manually did a 
>>> /etc/init.d/couchdb restart
>>> Usually couchdb restarts itself, but not with this crash.
>>> 
>>> 
>>> 
>>> I flipped the log to debug level, and still had no warning about the crash:
>>> 
>>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' 
>>> /carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
>>> Headers: [{'Accept',"application/json"},
>>>         {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
>>>         {'Content-Length',"346"},
>>>         {'Content-Type',"application/json"},
>>>         {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
>>>         {'User-Agent',"CouchDB/1.4.0"},
>>>         {"X-Couch-Full-Commit","false"}]
>>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
>>> [Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc 
>>> batch of size 128531 bytes
>>> 
>>> And that was it.  CouchDB was down and out.
>>> 
>>> I even tried shutting off the data processing (so as to reduce the db
>>> load) on box B, but that didn't help (all the crashing has put it far
>>> behind in replicating box A and C).
>>> 
>>> My guess is that the replication load is too big (too many
>>> connections, too much data being pushed in), but I would expect some
>>> sort of warning before the server dies.  
>>> 
>>> Any clues or suggestions would be appreciated.  I am currently going
>>> to try compling from source directly, but I don't have much faith that
>>> it will make a difference.
>>> 
>>> Thanks,
>>> James Marca
>>> 
>>> -- 
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>> 
>> 
> 

Reply via email to