Re: couchdb crashes silently

Jim Klo Thu, 31 Oct 2013 21:31:14 -0700

I noticed this myself (the bootstrap shell script not working). I vaguely 
recall that determining that the watchdog process doesn't correctly monitor the 
pid file. The logic in general was off - basically there's an edge condition 
not accounted for. I don't remember if I fixed the script or not - I'd have to 
hunt through my notes when I get back to a real computer. Something tells me I 
wrapped in a cron to clean up and restart as I was under a timeline before the 
world nearly came to an end earlier this month.


Jim Klo
Senior Software Engineer
SRI International
t: @nsomnac

On Oct 31, 2013, at 5:19 PM, "Nathan Vander Wilt" 
<[email protected]<mailto:[email protected]>> wrote:

Okay, may have figured out why the shell script isn't restarting Couch. It 
seems it may not actually die all the way. I can't connect to it, but there is 
a process matching the pidfile:

6417 ?        Sl    14:34 
/home/ubuntu/bc2/build/lib/erlang/erts-5.10.2/bin/beam -Bd -K true -A 4 -- 
-root /home/ubuntu/bc2/build/lib/erlang -progname erl -- -home /home/ubuntu -- 
-noshell -noinput -os_mon start_memsup false start_cpu_sup false 
disk_space_check_interval 1 disk_almost_full_threshold 1 -sasl errlog_type 
error -couch_ini bc2/build/etc/couchdb/default.ini production_couch/local.ini 
-s couch -pidfile production_couch/couch.pid -heart

hth,
-nvw



On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt 
<[email protected]<mailto:[email protected]>> wrote:

Aaaand my Couch commited suicide again today. Unless this is something 
different, I may have finally gotten lucky and had CouchDB leave a note [eerily 
unfinished!] in the logs this time:
https://gist.github.com/natevw/fd509978516499ba128b

```
** Reason == {badarg,
                [{io,put_chars,
                     [<0.93.0>,unicode,
                      <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 
66.249.66.216 - - GET 
/public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor
 200\n">>],
                     []},
```

So…now what? I have a rebuilt version of CouchDB I'm going to try [once I 
figure out why *it* isn't starting] but this is still really upsetting — I'm 
aware I could add my own cronjob or something to check and restart if needed 
every minute, but a) the shell script is SUPPOSED to be keeping CouchDB and b) 
it's NOT and c) this is embarrassing and aggravating.

thanks,
-natevw


On Oct 29, 2013, at 9:42 AM, Nathan Vander Wilt 
<[email protected]<mailto:[email protected]>> wrote:

I am starting CouchDB 1.4.0 using `bc2/build/bin/couchdb -b -r 5 […output and 
configuration options…]` and keep pulling up my sites finding them dead too. 
Seems to be about the same thing as others are reporting in this old thread…was 
there any resolution?

This is not an OOM thing, in dmesg I do see some killed processes (node) but 
never couchdb/beam and NOTHING killed after I added swap some several days ago. 
CouchDB was dead again this morning.

The only trace of trouble in the logs is in couch.stderr:

```
heart_beat_kill_pid = 32575
heart_beat_timeout = 11
heart: Sat Oct  5 02:59:16 2013: heart-beat time-out, no activity for 12 seconds
Killed
heart: Sat Oct  5 02:59:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb 
-k" -> 256. Terminating.

heart_beat_kill_pid = 13781
heart_beat_timeout = 11
heart: Tue Oct 22 19:50:40 2013: heart-beat time-out, no activity for 15 seconds
Killed
heart: Tue Oct 22 19:51:11 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb 
-k" -> 256. Terminating.

heart_beat_kill_pid = 15292
heart_beat_timeout = 11
heart: Tue Oct 29 12:33:17 2013: heart-beat time-out, no activity for 14 seconds
Killed
heart: Tue Oct 29 12:33:18 2013: Executed "/home/ubuntu/bc2/build/bin/couchdb 
-k" -> 256. Terminating.

heart_beat_kill_pid = 29158
heart_beat_timeout = 11
```

1. What are these "heart-beat time-out" logs about? Is that a clue to the 
trouble?
2. Regardless, why isn't the shell script restarting CouchDB after 5 seconds 
like I told it to?

`erlang:display(erlang:system_info(otp_release)).`  says R15B

thanks,
-natevw



On Sep 13, 2013, at 3:20 PM, James Marca 
<[email protected]<mailto:[email protected]>> wrote:

I am seeing a lot of random, silent crashes on just *one* of my
CouchDB servers.

couchdb version 1.4.0 (gentoo ebuild)

erlang also from gentoo ebuild:
Erlang (BEAM) emulator version 5.10.2
Compiled on Fri Sep 13 08:39:20 2013
Erlang R16B01 (erts-5.10.2) [source] [64-bit] [smp:8:8]
[async-threads:10] [kernel-poll:false]

I've got 3 servers running couchdb, A, B, C, and only B is crashing.
All of them are replicating a single db between them, with B acting as
the "hub"...A pushes to B, B pushes to both A and C, and C pushes to
B.

All three servers have data crunching jobs running that are reading
and writing to the database that is being replicated around.

The B server, the one in the middle that is push replicating to both A
and C, is the one that is crashing.

The log looks like this:

[Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9164.2>] 128.xxx.xx.xx - - GET 
/carb%2Fgrid%2Fstate4k%2fhpms/95_232_2007-01-07%2000%3A00 404
[Fri, 13 Sep 2013 15:43:28 GMT] [info] [<0.9165.2>] 128.xxx.xx.xx - - GET 
/carb%2Fgrid%2Fstate4k%2fhpms/115_202_2007-01-07%2000%3A00 404
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.32.0>] Apache CouchDB has started on 
http://0.0.0.0:5984/
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start 
replication `84213867ea04ca187d64dbf447660e52+continuous+create_target` 
(document `carb_grid_state4k_push_emma64`).
[Fri, 13 Sep 2013 15:48:23 GMT] [info] [<0.138.0>] Attempting to start 
replication `e663b72fa13b3f250a9b7214012c3dee+continuous` (document 
`carb_grid_state5k_hpms_push_kitty`).

no warning that the server died or why, and nothing in the
/var/log/messages about anything untoward  happening (no OOM killer
invoked or anything like that)

The restart only happened because I manually did a
/etc/init.d/couchdb restart
Usually couchdb restarts itself, but not with this crash.



I flipped the log to debug level, and still had no warning about the crash:

[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] 'POST' 
/carb%2Fgrid%2Fstate4k%2Fhpms/_bulk_docs {1,1} from "128.xxx.xx.yy"
Headers: [{'Accept',"application/json"},
       {'Authorization',"Basic amFtZXM6eW9ndXJ0IHRvb3RocGFzdGUgc2hvZXM="},
       {'Content-Length',"346"},
       {'Content-Type',"application/json"},
       {'Host',"xxxxxxxx.xxx.xxx.xxx:5984"},
       {'User-Agent',"CouchDB/1.4.0"},
       {"X-Couch-Full-Commit","false"}]
[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.28750.2>] OAuth Params: []
[Fri, 13 Sep 2013 21:57:15 GMT] [debug] [<0.175.0>] Worker flushing doc batch 
of size 128531 bytes

And that was it.  CouchDB was down and out.

I even tried shutting off the data processing (so as to reduce the db
load) on box B, but that didn't help (all the crashing has put it far
behind in replicating box A and C).

My guess is that the replication load is too big (too many
connections, too much data being pushed in), but I would expect some
sort of warning before the server dies.

Any clues or suggestions would be appreciated.  I am currently going
to try compling from source directly, but I don't have much faith that
it will make a difference.

Thanks,
James Marca

--
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

Re: couchdb crashes silently

Reply via email to