Hey guys just curious if you are monitoring the RAM/swap and disk IO on
these machines? I have seen some RAM peaks that make couch unresponsive and
may be that will make sense. I suggest next time you see an unresponsive
machine you try to isolate it, look at OS metrics and check the erlang
shell if it connects. May be give debug access to someone who can run full
diagnostics.

On Mon, 20 May 2019 at 14:29, f z <fuzzyillo...@gmail.com> wrote:

> May be connected, may be not, to the random crash problem but... as we did
> not do any replication while testing for the crash (relying on file copying
> for backup) we did not notice this before now, but now I've tried to do a
> one-time one-way replica of the production db, I'm noticing that's taking
> more than 2 hours (we just started the third hour and we are at about 2/3)
> to replicate a simple db of slightly more than 3000 records and a size of
> about 3 mb.
>
> Only a couple of simple views and and indexes, and I compacted the db
> before this. And neither the network or the disk are the bottlenecks. In
> fact, the CPU on both machines, while showing some signs of working, are
> mostly free, as the ram and disk usage stats and other parameters. It's
> simply.. taking forever.
>
> Oh, and while doing this operation the source db got stuck hanging already
> a couple of times... always without messages or any other visible symptom
> of problems.
>
> I'm assuming I'm doing something wrong, because if this was standard
> behavior I don't see how this product would be usable in any kind of
> serious use, but... how can i find what's wrong? I combed the
> documentation, but as a "standard stock-release installer non erlang-expert
> user", I don't know where to go from here. It would be really helpful
> having an idea of the things that couchdb does not "like"...
>
>
> Il giorno sab 18 mag 2019 alle ore 02:07 f z <fuzzyillo...@gmail.com> ha
> scritto:
>
> > Yeah, the last couple of hangings happened while not having any proxy at
> > all in front of them. We are currently running the last 2.3.1
> > We were using docker containers, but now we have tried to move to a
> > straight installation on a physical server.
> >
> > Unluckily as it can go for many days before hanging we will know if it
> > have helped only in weeks... Ill post updates here for posterity if
> > anything else.
> >
> > The usage is still quite light, with a few hundreds of records, few
> > indexes, one view and mostly usage of mango queries, and in fact outside
> of
> > when it hangs it work blazingly fast.
> >
> > On Fri, 17 May 2019, 19:33 Ben Field, <b...@geocaching.com> wrote:
> >
> >> We experience something similar with our 6 node cluster.  Every few
> weeks
> >> each node will get in a hung state with nothing getting written to the
> >> logs.  It seems to happen to all of them, one at a time.  Our workaround
> >> is
> >> a cron job that restarts the service when it stops responding.
> >> Interestingly it can happen multiple times for the same servers all in a
> >> short period of time, and then it will be fine again for another few
> >> weeks.
> >>
> >> We're still running v2.2.0 though, but the problem seemed to be much
> more
> >> frequent under v2.1.x.  Not sure if it's related to your issue, but we
> >> also
> >> haven't found a fix yet.  Ours are getting proxied behind nginx though.
> >>
> >>
> >>
> >>
> >> On Fri, May 17, 2019 at 2:48 AM f z <fuzzyillo...@gmail.com> wrote:
> >>
> >> > After a couple more weeks of testing, as an helpful follow up for
> those
> >> > that may follow, shutting down the replication only have increased the
> >> time
> >> > between random couchdb hangings, not eliminated them.
> >> > Same characteristics as before. Everything works perfectly without any
> >> > signs of errors, till the moment it start freezing, always without
> >> signs of
> >> > errors.
> >> > If somebody have suggestions about what I can try, they are welcome...
> >> >
> >> > Il giorno ven 3 mag 2019 alle ore 13:58 f z <fuzzyillo...@gmail.com>
> ha
> >> > scritto:
> >> >
> >> > > As a maybe helpful follow up to the previous problem for people that
> >> may
> >> > > find this post in future:
> >> > > after many tests, it seems that the problem I mentioned is related
> to
> >> the
> >> > > replication we set up, as I wondered in the post.
> >> > > If the replication is turned off, the crash does not occur, at least
> >> in
> >> > > weeks of testing, while if it's turned on it does (replication works
> >> > > perfectly till the moment the database freeze, and after a reboot
> >> starts
> >> > > working fine again).
> >> > > We discovered while reading the documentation a piece we missed
> >> earlier,
> >> > > that couchdb have issues with continuous replication if its behind
> an
> >> > > apache reverse proxy, and that was the case here, so maybe that's
> the
> >> > > culprit, although it's not clear how a caching problem may lead to
> the
> >> > > database freezing without any error messages.
> >> > >
> >> > > We will now test with replication and a different reverse proxy in
> >> front
> >> > > of the db, to be sure that's the case.
> >> > >
> >> > > Il giorno mar 23 apr 2019 alle ore 12:43 f z <
> fuzzyillo...@gmail.com>
> >> ha
> >> > > scritto:
> >> > >
> >> > >> Hello, we are developing a new application based on a couchdb
> >> database,
> >> > >> and we had some strange problems.
> >> > >> We are using couchdb 2.3.0 in docker installation based on the
> >> official
> >> > >> images.
> >> > >> During the development cycle everything worked fine with simulated
> >> > >> workloads comparable to the final usage.
> >> > >> When we installed the application on two early production machines,
> >> with
> >> > >> a workload much lighter than the development one and more powerful
> >> > >> dedicated servers, during the night while being unused by any user
> >> both
> >> > of
> >> > >> the couchdb started behaving erratically, one of them freezing
> >> > completely:
> >> > >> in the logs I can't see anything logged since the moment it froze,
> >> the
> >> > >> process seems to be running, but any call made by curl to couch
> throw
> >> > out
> >> > >> after a lenghty timeout an error like the following:
> >> > >>
> >> > >>
> >> >
> >>
> *{"error":"case_clause","reason":"{timeout,{[{{shard,<<\"shards/00000000-1fffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [0,536870911],\n
> >> > >> #Ref<0.0.2097153.224211>,[]},\n            nil},\n
> >> > >> {{shard,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [536870912,1073741823],\n
> >> > >> #Ref<0.0.2097153.224212>,[]},\n            27988},\n
> >> > >> {{shard,<<\"shards/40000000-5fffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [1073741824,1610612735],\n
> >> > >> #Ref<0.0.2097153.224213>,[]},\n            nil},\n
> >> > >> {{shard,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [1610612736,2147483647],\n
> >> > >> #Ref<0.0.2097153.224214>,[]},\n            28553},\n
> >> > >> {{shard,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [2147483648,2684354559],\n
> >> > >> #Ref<0.0.2097153.224215>,[]},\n            27969},\n
> >> > >> {{shard,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [2684354560,3221225471],\n
> >> > >> #Ref<0.0.2097153.224216>,[]},\n            28502},\n
> >> > >> {{shard,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [3221225472,3758096383],\n*
> >> > >> *#Ref<0.0.2097153.224217>,[]},\n            28633},\n
> >> > >> {{shard,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [3758096384,4294967295],\n
> >> > >> #Ref<0.0.2097153.224218>,[]},\n            28010}],\n
> >> > >> [{{shard,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [3758096384,4294967295],\n
> >> > >> #Ref<0.0.2097153.224218>,[]},\n            0},\n
> >> > >> {{shard,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [3221225472,3758096383],\n
> >> > >> #Ref<0.0.2097153.224217>,[]},\n            0},\n
> >> > >> {{shard,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [2684354560,3221225471],\n
> >> > >> #Ref<0.0.2097153.224216>,[]},\n            0},\n
> >> > >> {{shard,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [2147483648,2684354559],\n                  *
> >> > >> *#Ref<0.0.2097153.224215>,[]},\n            0},\n
> >> > >> {{shard,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [1610612736,2147483647],\n
> >> > >> #Ref<0.0.2097153.224214>,[]},\n            0},\n
> >> > >> {{shard,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>,\n
> >> > >> nonode@nohost,<<\"mastercom\">>,\n
> >> > >> [536870912,1073741823],\n
> >> > >> #Ref<0.0.2097153.224212>,[]},\n            0}],\n
> >> > >>
> [[{db_name,<<\"shards/e0000000-ffffffff/mastercom.1555403400\">>},\n
> >> > >> {engine,couch_bt_engine},\n            {doc_count,371},\n
> >> > >> {doc_del_count,0},\n            {update_seq,28010},\n
> >> > >> {purge_seq,0},\n            {compact_running,false},\n
> >> > >> {sizes,{[{active,1282636},{external,109712},{file,1298647}]}},\n
> >> > >> {disk_size,1298647},\n            {data_size,1282636},\n
> >> > >> {other,{[{data_size,109712}]}},\n
> >> > >> {instance_start_time,<<\"1555521000894744\">>},\n
> >> > >> {disk_format_version,7},\n
> >> > >> {committed_update_seq,28010},\n
> >> > >> {compacted_seq,28010},\n
> >> > >> {uuid,<<\"c3f60d2791bf5c3de01063af95f255b1\">>}],\n  *
> >> > >>
> *[{db_name,<<\"shards/c0000000-dfffffff/mastercom.1555403400\">>},\n
> >> > >> {engine,couch_bt_engine},\n            {doc_count,379},\n
> >> > >> {doc_del_count,1},\n            {update_seq,28633},\n
> >> > >> {purge_seq,0},\n            {compact_running,false},\n
> >> > >> {sizes,{[{active,1314436},{external,117751},{file,1331415}]}},\n
> >> > >> {disk_size,1331415},\n            {data_size,1314436},\n
> >> > >> {other,{[{data_size,117751}]}},\n
> >> > >> {instance_start_time,<<\"1555521000894769\">>},\n
> >> > >> {disk_format_version,7},\n
> >> > >> {committed_update_seq,28633},\n
> >> > >> {compacted_seq,28633},\n
> >> > >> {uuid,<<\"740de6d1e5535ba64e43bc3ead349e84\">>}],\n
> >> > >> [{db_name,<<\"shards/a0000000-bfffffff/mastercom.1555403400\">>},\n
> >> > >> {engine,couch_bt_engine},\n            {doc_count,378},\n
> >> > >> {doc_del_count,0},\n            {update_seq,28502},\n
> >> > >> {purge_seq,0},\n            {compact_running,false},\n
> >> > >> {sizes,{[{active,1308584},{external,116630},{file,1413335}]}},\n
> >> > >> {disk_size,1413335},\n            {data_size,1308584},\n
> >> > >> {other,{[{data_size,116630}]}},\n
> >> > >> {instance_start_time,<<\"1555521000894704\">>},\n
> >> > >> {disk_format_version,7},\n
> >> > >> {committed_update_seq,28502},\n
> >> > >> {compacted_seq,28502},\n
> >> > >> {uuid,<<\"d6acfc8cdf0a172df3d8db22df1a4b68\">>}],\n*
> >> > >>
> *[{db_name,<<\"shards/80000000-9fffffff/mastercom.1555403400\">>},\n
> >> > >> {engine,couch_bt_engine},\n            {doc_count,373},\n
> >> > >> {doc_del_count,0},\n            {update_seq,27969},\n
> >> > >> {purge_seq,0},\n            {compact_running,false},\n
> >> > >> {sizes,{[{active,1290174},{external,125829},{file,1335511}]}},\n
> >> > >> {disk_size,1335511},\n            {data_size,1290174},\n
> >> > >> {other,{[{data_size,125829}]}},\n
> >> > >> {instance_start_time,<<\"1555521000894811\">>},\n
> >> > >> {disk_format_version,7},\n
> >> > >> {committed_update_seq,27969},\n
> >> > >> {compacted_seq,27969},\n
> >> > >> {uuid,<<\"6735a0b95e5ebdc3c5dfe08b4c81a108\">>}],\n
> >> > >> [{db_name,<<\"shards/60000000-7fffffff/mastercom.1555403400\">>},\n
> >> > >> {engine,couch_bt_engine},\n            {doc_count,379},\n
> >> > >> {doc_del_count,0},\n            {update_seq,28553},\n
> >> > >> {purge_seq,0},\n            {compact_running,false},\n
> >> > >> {sizes,{[{active,1312728},{external,123640},{file,1327319}]}},\n
> >> > >> {disk_size,1327319},\n            {data_size,1312728},\n
> >> > >> {other,{[{data_size,123640}]}},\n
> >> > >> {instance_start_time,<<\"1555521000894765\">>},\n
> >> > >> {disk_format_version,7},\n
> >> > >> {committed_update_seq,28553},\n
> >> > >> {compacted_seq,28553},\n
> >> > >> {uuid,<<\"4086cc00f08dea80fd962808cf9e37bf\">>}],\n*
> >> > >>
> *[{db_name,<<\"shards/20000000-3fffffff/mastercom.1555403400\">>},\n
> >> > >> {engine,couch_bt_engine},\n            {doc_count,372},\n
> >> > >> {doc_del_count,0},\n            {update_seq,27988},\n
> >> > >> {purge_seq,0},\n            {compact_running,false},\n
> >> > >> {sizes,{[{active,1287520},{external,121646},{file,1302743}]}},\n
> >> > >> {disk_size,1302743},\n            {data_size,1287520},\n
> >> > >> {other,{[{data_size,121646}]}},\n
> >> > >> {instance_start_time,<<\"1555521000894678\">>},\n
> >> > >> {disk_format_version,7},\n
> >> > >> {committed_update_seq,27988},\n
> >> > >> {compacted_seq,27988},\n
> >> > >> {uuid,<<\"c1327baaacb6ac82b71d2c540648e82d\">>}],\n
> >> > >> {cluster,[{q,8},{n,1},{w,1},{r,1}]}]}}","ref":3845389673}*
> >> > >>
> >> > >> On the other server, one of the databases went on answering
> >> regularly to
> >> > >> requests, while the other hanged up with messages like the one
> above.
> >> > Logs
> >> > >> were regular for the answer database, not-existant for the hanged
> up
> >> > one.
> >> > >> While trying to access some general-level stats, the still working
> >> one
> >> > >> froze too.
> >> > >> Only an hard reboot of the docker managed to have them working
> >> again. No
> >> > >> data was apparently corrupted or problematic.
> >> > >> The day after, it happened again, although this time they were both
> >> > >> completely frozen when we found them.
> >> > >> In the entire db, there were only about a 2k quite small records,
> no
> >> > >> views and only a couple of indexes for the mango queries.
> >> > >> The db that the first time hanged individually on the second server
> >> was
> >> > >> completely empty.
> >> > >> The main difference between the development environment and the
> >> crashing
> >> > >> one is that in this one there is also an unfiltered bidirectional
> >> > >> replication setup, if this can be related to the problem, but
> before
> >> the
> >> > >> crash, the replication works perfectly without errors or any
> strange
> >> > >> message.
> >> > >>
> >> > >>
> >> > >> I realize that these may be insufficient information here to
> >> understand
> >> > >> the cause, but the problem is that it happens unreliably, it never
> >> > happened
> >> > >> in months in the development environment, the parameters that we
> >> checked
> >> > >> are all the same, and I've no idea were to watch for some signs of
> >> what
> >> > may
> >> > >> be the trouble.
> >> > >>
> >> > >> What can I check? Which parameters may be the ones responsible for
> >> such
> >> > a
> >> > >> problem?
> >> > >>
> >> > >> Thanks in advance to everybody...
> >> > >>
> >> > >
> >> >
> >>
> >
>

Reply via email to