Re: couchdb crashes silently

Nathan Vander Wilt Mon, 11 Nov 2013 14:11:45 -0800

Aaaaand this happened *again* over the weekend. This time I had started CouchDB 
in a screen session, which was still running. Again, it looked like both the 
shell script processes and the beam one were both still running, just…no Couch.


I had debug logs going, the stdout records shows the logger dying again but not 
with any unicode error type event, just the last log:
https://gist.github.com/natevw/dcd4a9a973da01270735

There is some "heart: Sat Nov  9 08:35:30 2013: heart-beat time-out, no 
activity for 26 seconds" in the stderr log but I'm not sure it's related or 
not…there seem to be a few more heart-beat time-outs than actual CouchDB server 
failures.

Any concrete suggestions…? This sucks. I'm burnt out poking through debug logs 
on this, I'm embarrassed and angry every time I discover my sites have been 
down for another day or two because of this, and adding another layer of twine 
and baling wire in the form of a _second_ shell watchdog script is not at all 
exciting >:-(

regards,
-natevw



On Nov 1, 2013, at 9:17 AM, Nathan Vander Wilt <[email protected]> wrote:

> 
> On Nov 1, 2013, at 12:10 AM, Dave Cottlehuber <[email protected]> wrote:
> 
>>> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt > 
>>> wrote:
>>> 
>>> Aaaand my Couch commited suicide again today. Unless this is  
>>> something different, I may have finally gotten lucky and had  
>>> CouchDB leave a note [eerily unfinished!] in the logs this time:  
>>> https://gist.github.com/natevw/fd509978516499ba128b  
>>> 
>>> ```
>>> ** Reason == {badarg,
>>> [{io,put_chars,
>>> [<0.93.0>,unicode,
>>> <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216  
>>> - - GET 
>>> /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor
>>>   
>>> 200\n">>],
>>> []},
>>> ```
>>> 
>>> So…now what? I have a rebuilt version of CouchDB I'm going to try  
>>> [once I figure out why *it* isn't starting] but this is still really  
>>> upsetting — I'm aware I could add my own cronjob or something to  
>>> check and restart if needed every minute, but a) the shell script  
>>> is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is  
>>> embarrassing and aggravating.
>>> 
>>> thanks,
>>> -natevw
>> 
>> So there’s 2 things here
>> 
>> - why the couch doesn’t get restarted?
>> 
>> Sounds very much like the afore mentioned pid race condition. Wendall do you 
>> know any more about this? I thought you had some ideas about it IIRC.
>> 
> 
> 
> I think I figured out the answer to this one, at least in the latest crash. 
> The Erlang process the shell script watches was still running, just not 
> accepting connections. I didn't notice this the previous times, though…I only 
> realized it this time because when I went to restart the shell script acted 
> like it was already running. So maybe there's actually two crashes, one 
> silent heartbeat one and this unicode?
> 
> 
> 
>> - why io:putchars/2 has trouble writing to a boring log file, which 
>> obviously works most of the time.
>> 
>> <0.93.0>,unicode, <<"[Thu, 31 Oct 2013 19:48:48 GMT...”>>
>> 
>> io:put_chars(Fd, unicode, <<Binary>>) doesn’t look right — there’s no 
>> io:put_chars/3. 
>> 
>> This unicode looks weird and from a quick look I can’t see where it should 
>> come from.
>> 
>> Can you get more of the logfile (like hundreds of lines) and stick it 
>> somewhere? email is fine.
>> 
>> I’d like to see what happens to <0.93.0> (the process wrapping the log fd), 
>> and also if the unicode atom turns up anywhere else prior.
> 
> 
> You want more of the log *up to* the crash? Because I have nothing *beyond* 
> what is in that gist, that's the thing! The end of the log was cut off, I did 
> not snip it. The log as it sits now has these exact lines in it:
> 
> ```
>                             {line,173}]},
>                           {gen_event,ser
> Apache CouchDB 1.4.0 (LogLevel=info) is starting.
> ```
> 
> (The subsequent "starting" is due to my intervention.)
> 
> -nvw

Re: couchdb crashes silently

Reply via email to