I don't have any immediate thoughts no. I'm not "recite details from memory" familiar with this part of the code base. AFAIK it could be anything from a networking blip to a pathological log formatting issue.
On Thu, Sep 1, 2011 at 9:11 PM, kowsik <[email protected]> wrote: > Wow, I'm shocked by the eerie silence on this. So I take it, there are > no clues in my prior emails to figure out why the replicator is > backing up and then dumping a 500,000 line stack trace? > > Dunno if it helps, but here's what we see. The number of documents > between the two clusters will start to differ (meaning things are not > replicating fast enough) and then we'll see 100% CPU utilization one > of the them at the same time watching the memory utilization grow. > Could it be the geo-latency that's causing the problem? > > Just to see if it makes a difference, we are moving our CouchDB > cluster to an m2.2xlarge instance (big honking instance with fast IO) > as well as using instance storage instead of EBS. Will report back on > what we see. But we definitely could use some help here. > > Thanks, > > K. > --- > http://blitz.io > @pcapr > > On Thu, Sep 1, 2011 at 7:29 AM, kowsik <[email protected]> wrote: >> One more observation. It seems the memory goes up dramatically while >> the replicator task is writing all the failed-to-replicate-docs to the >> log (ends with this) >> >> ** Reason for termination == >> ** {http_request_failed,<<"failed to replicate http://host/db">>} >> >> Is there a way to disable logging for the replicator? Interestingly >> enough, as soon as we restart, the replicator simply catches up and >> pretends there were no problems. >> >> K. >> --- >> http://blog.mudynamics.com >> http://blitz.io >> @pcapr >> >> On Thu, Sep 1, 2011 at 7:18 AM, kowsik <[email protected]> wrote: >>> Right before I sent this email we restarted CouchDB and now it's at >>> 14% memory usage and climbing. Is there anything we can look at >>> stats-wise and see where the pressure in the system is? I realize task >>> stats are being added to trunk, but on 1.1, anything? >>> >>> Thanks, >>> >>> K. >>> --- >>> http://blog.mudynamics.com >>> http://blitz.io >>> @pcapr >>> >>> On Thu, Sep 1, 2011 at 6:35 AM, Scott Feinberg <[email protected]> >>> wrote: >>>> I haven't had that issue-though I'm not using using 1.1 in a >>>> production environment, just using it to replicate like crazy (millions of >>>> docs in each of my 20+ databases). I was running a server with 1 GB of >>>> memory and didn't have an issue, it handled it fine. >>>> >>>> However... from http://docs.couchbase.org/couchdb-release-1.1/index.html >>>> >>>> When you PUT/POST a document to the _replicator database, CouchDB will >>>> attempt to start the replication up to 10 times (configurable under >>>> [replicator], parameter max_replication_retry_count). >>>> >>>> Not sure if that helps. >>>> >>>> --Scott >>>> >>>> On Thu, Sep 1, 2011 at 9:28 AM, kowsik <[email protected]> wrote: >>>> >>>>> Ran into this twice so far in production CouchDB in the last two days. >>>>> We are running CouchDB 1.1 on an EC2 AMI with multi-master replication >>>>> across two regions. I notice that every now and then CouchDB will >>>>> simply suck up 100% CPU 50% of the total memory and not respond at >>>>> all. So far the logs only show sporadic replication errors. One of the >>>>> stack traces (failed to replicate after 10 times) is about 500,000 >>>>> lines long. We are using the _replicator database. >>>>> >>>>> Anyone else running into this? Since 1.1 doesn't have the >>>>> try-until-infinity-and-beyond mode, we have a worker task that watches >>>>> the _replication_state and kicks the replicator as soon as it errors >>>>> out. Are there any settings in terms replicator memory usage, etc that >>>>> could help us? >>>>> >>>>> Thanks! >>>>> >>>>> K. >>>>> --- >>>>> http://blog.mudynamics.com >>>>> http://blitz.io >>>>> @pcapr >>>>> >>>> >>> >> >
