We were running some tests against a system of couch servers
replicating continuously to each other when the server/erlang
process crashed. The couch startup script restarted the server
process per the respawn timeout option.
These tests were mainly single shot requests, not load tests.
In the couch stdout log there is a long series of crash reports,
stack traces, etc. leading up to the restart.
Can't say I understand exactly what I am reading or what is
significant, but it looks like couch failed somewhere in reading
the btree file for one of the databases that was under
replication. Some views in this database were also being
updated at the same time.
This is couch 1.0.1, erlang R14B, on Solaris 10.
Here are some excerpts from the log. I can provide more data if needed.
=CRASH REPORT==== 2-Nov-2010::20:43:52 ===
crasher:
initial call: couch_file:init/1
pid: <0.1111.10>
registered_name: []
exception exit: {{badmatch,{ok
<snip...>
=CRASH REPORT==== 2-Nov-2010::20:43:57 ===
crasher:
initial call: couch_rep_reader:init/1
pid: <0.1971.10>
registered_name: []
exception exit: {function_clause,
[{couch_rep_reader,handle_info,
[{'EXIT',<0.1972.10>,
{noproc,
{gen_server,call,[<0.1113.10>,{drop,<0.1972.10>}]}}},
{state,<0.1952.10>,
{db,<0.13069.7>,<0.13070.7>,nil,
<<"1288298701472154">>,<0.1111.10>,<0.1113.10>,
{db_header,5,32,0,
{84585,{13,0}},
{84709,13},
{12377,[]},
0,nil,nil,1000},
32,
{btree,<0.1111.10>,
{84585,{13,0}},
#Fun<couch_db_updater.7.69395062>,
#Fun<couch_db_updater.8.86519079>,
#Fun<couch_btree.5.124754102>,
#Fun<couch_db_updater.9.24674233>},
{btree,<0.1111.10>,
{84709,13},
#Fun<couch_db_updater.10.90337910>,
#Fun<couch_db_updater.11.13595824>,
#Fun<couch_btree.5.124754102>,
#Fun<couch_db_updater.12.34906778>},
{btree,<0.1111.10>,
{12377,[]},
#Fun<couch_btree.0.83553141>,
#Fun<couch_btree.1.30790806>,
#Fun<couch_btree.2.124754102>,nil},
32,<<"mps_rm">>,
"/opt/couchdb/databases/mps_rm.couch",[],[],nil,
{user_ctx,null,
[<<"_admin">>],
<<"{couch_httpd_auth,
default_authentication_handler}">>},
nil,1000,
[before_header,after_header,on_file_open],
false},
<0.1969.10>,<0.1972.10>,[],0,
{[],[]},
{<0.1973.10>,#Ref<0.0.6.57758>},
false,0,nil,",","+"}]},
{gen_server,handle_msg,5},
{proc_lib,init_p_do_apply,3}]}
in function gen_server:terminate/6
ancestors: [<0.1952.10>,couch_rep_sup,couch_primary_services,
couch_server_sup,<0.31.0>]
messages: []
links: [<0.1952.10>]
dictionary: []
trap_exit: true
status: running
heap_size: 6765
stack_size: 24
reductions: 1396
neighbours:
=CRASH REPORT==== 2-Nov-2010::20:43:57 ===
crasher:
initial call: couch_rep_missing_revs:init/1
pid: <0.1969.10>
registered_name: []
exception exit: {noproc,
{gen_server,call,[<0.1113.10>,{drop,<0.1952.10>}]}}
in function gen_server:terminate/6
ancestors: [<0.1952.10>,couch_rep_sup,couch_primary_services,
couch_server_sup,<0.31.0>]
messages: []
links: [<0.1970.10>]
dictionary: []
trap_exit: true
status: running
heap_size: 987
stack_size: 24
reductions: 464
neighbours: