On 11. November 2013 at 23:10:38, Nathan Vander Wilt ([email protected]) 
wrote:
>  
> Aaaaand this happened *again* over the weekend. This time I had  
> started CouchDB in a screen session, which was still running.  
> Again, it looked like both the shell script processes and the  
> beam one were both still running, just…no Couch.
>  
> I had debug logs going, the stdout records shows the logger dying  
> again but not with any unicode error type event, just the last  
> log:
> https://gist.github.com/natevw/dcd4a9a973da01270735  
>  
> There is some "heart: Sat Nov 9 08:35:30 2013: heart-beat time-out,  
> no activity for 26 seconds" in the stderr log but I'm not sure it's  
> related or not…there seem to be a few more heart-beat time-outs  
> than actual CouchDB server failures.

when heartbeat times out, the wrapper script kills & restarts BEAM - this is 
part of Erlang VM stuff.

> Any concrete suggestions…? This sucks. I'm burnt out poking  
> through debug logs on this, I'm embarrassed and angry every time  
> I discover my sites have been down for another day or two because  
> of this, and adding another layer of twine and baling wire in the  
> form of a _second_ shell watchdog script is not at all exciting  
> >:-(
>  
> regards,
> -natevw

Remotely it’s hard to offer useful help (so many possibilities) but:

heart timeouts:
- long-running NIFs can do this by blocking the scheduler; especially do you 
have any large JSON docs moving in or out? Its likely that reverting to R14B01 
or B04 may resolve this, you’d need to rebuild couchdb though.
- possibly HIPE. You should be able to uninstall erlang-base-hipe and install 
erlang-base assuming its ubuntu (IIRC) to resolve that, rebooting couch reqd of 
course.
- there has been some mention of a possible resource leak in continuous 
replication (which you have) but I’d not expect it to hang the BEAM, `just` 
crash it.
- are there any erlang turds (erl_crash.dump) lying around? they contain some 
useful debug info.

resources:
You’ve not mentioned what’s the OS doing at this time — anything in 
/var/log/messages or dmesg or whatever? is this a lean VPS or a “real box” 
getting timeouts? Personally I’d install collectd or ganglia etc, pump general 
OS metrics out to graphite for comparing, and also start collecting some erlang 
vm metrics too, wrt to the possible resource leak.

I’ve not looked at recon for this http://ferd.github.io/recon/ but it could be 
useful; right now I’d pick https://github.com/jsonified/estatsd and send erlang 
vm stats to graphite, same as the OS stuff, and see what comes up when these 
issues occur.

A+
Dave

Reply via email to