Aaaaand this happened *again* over the weekend. This time I had started CouchDB in a screen session, which was still running. Again, it looked like both the shell script processes and the beam one were both still running, just…no Couch.
I had debug logs going, the stdout records shows the logger dying again but not with any unicode error type event, just the last log: https://gist.github.com/natevw/dcd4a9a973da01270735 There is some "heart: Sat Nov 9 08:35:30 2013: heart-beat time-out, no activity for 26 seconds" in the stderr log but I'm not sure it's related or not…there seem to be a few more heart-beat time-outs than actual CouchDB server failures. Any concrete suggestions…? This sucks. I'm burnt out poking through debug logs on this, I'm embarrassed and angry every time I discover my sites have been down for another day or two because of this, and adding another layer of twine and baling wire in the form of a _second_ shell watchdog script is not at all exciting >:-( regards, -natevw On Nov 1, 2013, at 9:17 AM, Nathan Vander Wilt <[email protected]> wrote: > > On Nov 1, 2013, at 12:10 AM, Dave Cottlehuber <[email protected]> wrote: > >>> On Oct 31, 2013, at 5:13 PM, Nathan Vander Wilt > >>> wrote: >>> >>> Aaaand my Couch commited suicide again today. Unless this is >>> something different, I may have finally gotten lucky and had >>> CouchDB leave a note [eerily unfinished!] in the logs this time: >>> https://gist.github.com/natevw/fd509978516499ba128b >>> >>> ``` >>> ** Reason == {badarg, >>> [{io,put_chars, >>> [<0.93.0>,unicode, >>> <<"[Thu, 31 Oct 2013 19:48:48 GMT] [info] [<0.31789.2>] 66.249.66.216 >>> - - GET >>> /public/_design/glob/_list/posts/by_path?key=%5B%222012%22%2C%2203%22%2C%22metakaolin_geojson_editor%22%5D&include_docs=true&path1=2012&path2=03&path3=metakaolin_geojson_editor >>> >>> 200\n">>], >>> []}, >>> ``` >>> >>> So…now what? I have a rebuilt version of CouchDB I'm going to try >>> [once I figure out why *it* isn't starting] but this is still really >>> upsetting — I'm aware I could add my own cronjob or something to >>> check and restart if needed every minute, but a) the shell script >>> is SUPPOSED to be keeping CouchDB and b) it's NOT and c) this is >>> embarrassing and aggravating. >>> >>> thanks, >>> -natevw >> >> So there’s 2 things here >> >> - why the couch doesn’t get restarted? >> >> Sounds very much like the afore mentioned pid race condition. Wendall do you >> know any more about this? I thought you had some ideas about it IIRC. >> > > > I think I figured out the answer to this one, at least in the latest crash. > The Erlang process the shell script watches was still running, just not > accepting connections. I didn't notice this the previous times, though…I only > realized it this time because when I went to restart the shell script acted > like it was already running. So maybe there's actually two crashes, one > silent heartbeat one and this unicode? > > > >> - why io:putchars/2 has trouble writing to a boring log file, which >> obviously works most of the time. >> >> <0.93.0>,unicode, <<"[Thu, 31 Oct 2013 19:48:48 GMT...”>> >> >> io:put_chars(Fd, unicode, <<Binary>>) doesn’t look right — there’s no >> io:put_chars/3. >> >> This unicode looks weird and from a quick look I can’t see where it should >> come from. >> >> Can you get more of the logfile (like hundreds of lines) and stick it >> somewhere? email is fine. >> >> I’d like to see what happens to <0.93.0> (the process wrapping the log fd), >> and also if the unicode atom turns up anywhere else prior. > > > You want more of the log *up to* the crash? Because I have nothing *beyond* > what is in that gist, that's the thing! The end of the log was cut off, I did > not snip it. The log as it sits now has these exact lines in it: > > ``` > {line,173}]}, > {gen_event,ser > Apache CouchDB 1.4.0 (LogLevel=info) is starting. > ``` > > (The subsequent "starting" is due to my intervention.) > > -nvw
