Hi Carlos, I wrote a post on monitoring CouchDB using Prometheus: https://hackernoon.com/monitoring-couchdb-with-prometheus-grafana-and-docker-4693bc8408f0
I’m not sure if it will provide all the metrics you need, but I hope this helps Geoff On Mon, Oct 9, 2017 at 3:53 AM Carlos Alonso <[email protected]> wrote: > I'd like to connect a diagnosing tool such as etop, observer, ... to see > which processes are open there but I cannot seem to have it working. > > Could anyone please share how to run any of those tools on a remote server? > > Regards > > On Sat, Oct 7, 2017 at 6:13 PM Carlos Alonso <[email protected]> > wrote: > > > So I could find another relevant symptom. After adding _system endpoint > > monitoring I have discovered that the particular node has a different > > behaviour than the other ones in terms of Erlang process count. > > > > The process_count metric of the normal nodes is stable around 1k to 1.3k > > while the other node's process_count is slowly but continuously growing > > until a little above than 5k processes that is when it gets 'frozen'. > After > > restarting the value comes back to the normal 1k to 1.3k (to immediately > > start slowly growing again, of course :)). > > > > Any idea? Thanks! > > > > On Tue, Oct 3, 2017 at 11:18 PM Carlos Alonso <[email protected]> > > wrote: > > > >> This is one of the complete errors sequences I can see: > >> > >> [error] 2017-10-03T21:13:16.716692Z couchdb@couchdb-node-1 emulator > >> -------- Error in process <0.24558.209> on node 'couchdb@couchdb-node-1 > ' > >> with exit value: > >> > >> > {{nocatch,{mp_parser_died,noproc}},[{couch_att,'-foldl/4-fun-0-',3,[{file,"src/couch_att.erl"},{line,591}]},{couch_att,fold_streamed_data,4,[{file,"src/couch_att.erl"},{line,642}]},{couch_att,foldl,4,[{file,"src/couch_att.erl"},{line,595} > >> > >> > ]},{couch_httpd_multipart,atts_to_mp,4,[{file,"src/couch_httpd_multipart.erl"},{line,208}]}]} > >> > >> [error] 2017-10-03T21:13:16.717606Z couchdb@couchdb-node-1 <0.5208.204> > >> aab326c0bb req_err(2515771787 <(251)%20577-1787>) badmatch : ok > >> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1 > >> L295">>,<<"chttpd:handle_request_int/1 > L231">>,<<"mochiweb_http:headers/6 > >> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>] > >> [error] 2017-10-03T21:13:16.717859Z couchdb@couchdb-node-1 > <0.20718.207> > >> -------- Replicator, request PUT to " > >> > http://127.0.0.1:5984/my_db/de45a832a1fac563c89da73dc7dc4d3e?new_edits=false > " > >> failed due to error {error, > >> {'EXIT', > >> {{{nocatch,{mp_parser_died,noproc}}, > >> ... > >> > >> Regards > >> > >> On Tue, Oct 3, 2017 at 11:05 PM Carlos Alonso <[email protected] > > > >> wrote: > >> > >>> The 'weird' thing about the mp_parser_died error is that, according to > >>> the description of the issue 745, the replication never finishes as the > >>> item that fails once, seems to fail forever, but in my case they fail, > but > >>> then they seem to work (possibly as the replication is retried), as I > can > >>> find the documents that generated the error (in the logs) in the target > >>> db... > >>> > >>> Regards > >>> > >>> On Tue, Oct 3, 2017 at 10:52 PM Carlos Alonso < > [email protected]> > >>> wrote: > >>> > >>>> So to give some more context this node is responsible for replicating > a > >>>> database that has quite many attachments and it raises the 'famous' > >>>> mp_parser_died,noproc error, that I think is this one: > >>>> https://github.com/apache/couchdb/issues/745 > >>>> > >>>> What I've identified so far from the logs is that along with the error > >>>> described above, also this error appears: > >>>> > >>>> [error] 2017-10-03T19:54:32.380379Z couchdb@couchdb-node-1 > >>>> <0.30012.3408> 520e44b7ae req_err(2515771787 <(251)%20577-1787>) > >>>> badmatch : ok > >>>> [<<"chttpd_db:db_doc_req/3 L780">>,<<"chttpd:process_request/1 > >>>> L295">>,<<"chttpd:handle_request_int/1 > L231">>,<<"mochiweb_http:headers/6 > >>>> L91">>,<<"proc_lib:init_p_do_apply/3 L240">>] > >>>> > >>>> Sometimes it appears just after the mp_parser_died error, sometimes > the > >>>> parser error happens without 'triggering' one of this badmatch ones. > >>>> > >>>> Then, after a while of this sequence, the initially described > >>>> sel_conn_closed error starts raising for all requests and the node > gets > >>>> frozen. It is not responsive but it is still not removed from the > cluster, > >>>> holding its replications and, obviously, not replicating anything > until it > >>>> is restarted. > >>>> > >>>> I can also see interleaved unauthorized errors, which don't make much > >>>> sense as I'm the only one accessing this cluster > >>>> > >>>> [error] 2017-10-03T19:33:47.022572Z couchdb@couchdb-node-1 > >>>> <0.32501.3323> c683120c97 rexi_server throw:{unauthorized,<<"You are > not > >>>> authorized to access this db.">>} [{couch_db,open,2 > >>>> > >>>> > ,[{file,"src/couch_db.erl"},{line,99}]},{fabric_rpc,open_shard,2,[{file,"src/fabric_rpc.erl"},{line,261}]},{rexi_server,init_p,3,[{file,"src/rexi_server.erl"},{line,139}]}] > >>>> > >>>> > >>>> To me, it feels like the mp_parser_died error slowly breaks something > >>>> that in the end brings the node unresponsive, as those errors happen > a lot > >>>> in that particular replication. > >>>> > >>>> Regards and thanks a lot for your help! > >>>> > >>>> > >>>> On Tue, Oct 3, 2017 at 7:59 PM Joan Touzet <[email protected]> wrote: > >>>> > >>>>> Is there more to the error? All this shows us is that the replicator > >>>>> itself attempted a POST and had the connection closed on it. > (Remember > >>>>> that the replicator is basically just a custom client that sits > >>>>> alongside CouchDB on the same machine.) There should be more to the > >>>>> error log that shows why CouchDB hung up the phone. > >>>>> > >>>>> ----- Original Message ----- > >>>>> From: "Carlos Alonso" <[email protected]> > >>>>> To: "user" <[email protected]> > >>>>> Sent: Tuesday, 3 October, 2017 4:18:18 AM > >>>>> Subject: Re: Trying to understand why a node gets 'frozen' > >>>>> > >>>>> Hello, this is happening every day, always on the same node. Any > ideas? > >>>>> > >>>>> Thanks! > >>>>> > >>>>> On Sun, Oct 1, 2017 at 11:42 AM Carlos Alonso < > >>>>> [email protected]> > >>>>> wrote: > >>>>> > >>>>> > Hello everyone!! > >>>>> > > >>>>> > I'm trying to understand an issue we're experiencing on CouchDB > 2.1.0 > >>>>> > running on Ubuntu 14.04. The cluster itself is currently > replicating > >>>>> from > >>>>> > another source cluster and we have seen that one node gets frozen > >>>>> from time > >>>>> > to time having to restart it to get it to respond again. > >>>>> > > >>>>> > Before getting unresponsive, the node throws a lot of {error, > >>>>> > sel_conn_closed}. See an example trace below. > >>>>> > > >>>>> > [error] 2017-10-01T05:25:23.921126Z couchdb@couchdb-1 <0.13489.0> > >>>>> > -------- gen_server <0.13489.0> terminated with reason: > >>>>> > {checkpoint_commit_failure,<<"Failure on target commit: > >>>>> > {'EXIT',{http_request_failed,\"POST\",\n > >>>>> \" > >>>>> > http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n > >>>>> > {error,sel_conn_closed}}}">>} > >>>>> > last msg: > >>>>> {'EXIT',<0.10626.0>,{checkpoint_commit_failure,<<"Failure on > >>>>> > target commit: {'EXIT',{http_request_failed,\"POST\",\n > >>>>> > \"http://127.0.0.1:5984/mydb/_ensure_full_commit\",\n > >>>>> > {error,sel_conn_closed}}}">>}} > >>>>> > state: {state,<0.10626.0>,<0.13490.0>,20,{httpdb," > >>>>> > https://source_ip/mydb/ > >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic > >>>>> > > >>>>> > ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{is_ssl,true},{socket_options,[{keepalive,true},{nodelay,false}]},{ssl_options,[{depth,3},{verify,verify_none}]}],10,250,<0.11931.0>,20,nil,undefined},{httpdb," > >>>>> > http://127.0.0.1:5984/mydb/ > >>>>> ",nil,[{"Accept","application/json"},{"Authorization","Basic > >>>>> > > >>>>> > ..."},{"User-Agent","CouchDB-Replicator/2.1.0"}],30000,[{socket_options,[{keepalive,true},{nodelay,false}]}],10,250,<0.11995.0>,20,nil,undefined},[],<0.25756.4748>,nil,{<0.13490.0>,#Ref<0.0.724041731.98305>},[{docs_read,1},{missing_checked,1},{missing_found,1}],nil,nil,{batch,[<<"{\"_id\":\"df84bfda818ea150b249da89e8d79a38\",\"_rev\":\"1-ebb0119fbdcad604ad372fa6e05d06a2\",...\":{\"start\":1,\"ids\":[\"ebb0119fbdcad604ad372fa6e05d06a2\"]}}">>],605}} > >>>>> > > >>>>> > The particular node is 'responsible' for a replication that has > >>>>> quite many > >>>>> > {mp_parser_died,noproc} errors, which AFAIK is a known bug ( > >>>>> > https://github.com/apache/couchdb/issues/745), but I don't know if > >>>>> that > >>>>> > may have any relationship. > >>>>> > > >>>>> > When that happens, just restarting the node brings it up and > running > >>>>> > properly. > >>>>> > > >>>>> > Any help would be really appreciated. > >>>>> > > >>>>> > Regards > >>>>> > -- > >>>>> > [image: Cabify - Your private Driver] <http://www.cabify.com/> > >>>>> > > >>>>> > *Carlos Alonso* > >>>>> > Data Engineer > >>>>> > Madrid, Spain > >>>>> > > >>>>> > [email protected] > >>>>> > > >>>>> > Prueba gratis con este código > >>>>> > #CARLOSA6319 <https://cabify.com/i/carlosa6319> > >>>>> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > >>>>> > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES > >>>>> >[image: > >>>>> > Linkedin] <https://www.linkedin.com/in/mrcalonso> > >>>>> > > >>>>> -- > >>>>> [image: Cabify - Your private Driver] <http://www.cabify.com/> > >>>>> > >>>>> *Carlos Alonso* > >>>>> Data Engineer > >>>>> Madrid, Spain > >>>>> > >>>>> [email protected] > >>>>> > >>>>> Prueba gratis con este código > >>>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319> > >>>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > >>>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES > >>>>> >[image: > >>>>> Linkedin] <https://www.linkedin.com/in/mrcalonso> > >>>>> > >>>>> -- > >>>>> Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a > >>>>> su > >>>>> destinatario, pudiendo contener información confidencial sometida a > >>>>> secreto > >>>>> profesional. No está permitida su reproducción o distribución sin la > >>>>> autorización expresa de Cabify. Si usted no es el destinatario final > >>>>> por > >>>>> favor elimínelo e infórmenos por esta vía. > >>>>> > >>>>> This message and any attached file are intended exclusively for the > >>>>> addressee, and it may be confidential. You are not allowed to copy or > >>>>> disclose it without Cabify's prior written authorization. If you are > >>>>> not > >>>>> the intended recipient please delete it from your system and notify > us > >>>>> by > >>>>> e-mail. > >>>>> > >>>> -- > >>>> [image: Cabify - Your private Driver] <http://www.cabify.com/> > >>>> > >>>> *Carlos Alonso* > >>>> Data Engineer > >>>> Madrid, Spain > >>>> > >>>> [email protected] > >>>> > >>>> Prueba gratis con este código > >>>> #CARLOSA6319 <https://cabify.com/i/carlosa6319> > >>>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > >>>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES > >[image: > >>>> Linkedin] <https://www.linkedin.com/in/mrcalonso> > >>>> > >>> -- > >>> [image: Cabify - Your private Driver] <http://www.cabify.com/> > >>> > >>> *Carlos Alonso* > >>> Data Engineer > >>> Madrid, Spain > >>> > >>> [email protected] > >>> > >>> Prueba gratis con este código > >>> #CARLOSA6319 <https://cabify.com/i/carlosa6319> > >>> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > >>> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES > >[image: > >>> Linkedin] <https://www.linkedin.com/in/mrcalonso> > >>> > >> -- > >> [image: Cabify - Your private Driver] <http://www.cabify.com/> > >> > >> *Carlos Alonso* > >> Data Engineer > >> Madrid, Spain > >> > >> [email protected] > >> > >> Prueba gratis con este código > >> #CARLOSA6319 <https://cabify.com/i/carlosa6319> > >> [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > >> <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES > >[image: > >> Linkedin] <https://www.linkedin.com/in/mrcalonso> > >> > > -- > > [image: Cabify - Your private Driver] <http://www.cabify.com/> > > > > *Carlos Alonso* > > Data Engineer > > Madrid, Spain > > > > [email protected] > > > > Prueba gratis con este código > > #CARLOSA6319 <https://cabify.com/i/carlosa6319> > > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES > >[image: > > Linkedin] <https://www.linkedin.com/in/mrcalonso> > > > -- > [image: Cabify - Your private Driver] <http://www.cabify.com/> > > *Carlos Alonso* > Data Engineer > Madrid, Spain > > [email protected] > > Prueba gratis con este código > #CARLOSA6319 <https://cabify.com/i/carlosa6319> > [image: Facebook] <http://cbify.com/fb_ES>[image: Twitter] > <http://cbify.com/tw_ES>[image: Instagram] <http://cbify.com/in_ES>[image: > Linkedin] <https://www.linkedin.com/in/mrcalonso> > > -- > Este mensaje y cualquier archivo adjunto va dirigido exclusivamente a su > destinatario, pudiendo contener información confidencial sometida a secreto > profesional. No está permitida su reproducción o distribución sin la > autorización expresa de Cabify. Si usted no es el destinatario final por > favor elimínelo e infórmenos por esta vía. > > This message and any attached file are intended exclusively for the > addressee, and it may be confidential. You are not allowed to copy or > disclose it without Cabify's prior written authorization. If you are not > the intended recipient please delete it from your system and notify us by > e-mail. >
