One trick, you can delete the doc and replicate with a filter like 'return !doc['_deleted'];' that blocks all deletes. the target db will then not receive any trace of these highly conflicted docs.
On 14 March 2013 14:10, Stephen Bartell <[email protected]> wrote: > > On Mar 14, 2013, at 11:44 AM, Robert Newson <[email protected]> wrote: > >> Conflicts are *not* removed during compaction, CouchDB has no way of >> knowing which ones it would be ok to delete. > > Yep, they need to be deleted in the context of the person/process > manipulating the docs. > >> >> CouchDB does struggle to process documents with lots of conflicts, >> we've encountered this at Cloudant a fair bunch. We resolve the >> conflicts via http if possible or, if that consistently fails, with a >> direct erlang manipulation. It's certainly something we need to >> improve. >> > > But even deleting them yields the same problem. When replicating, the > _deleted_conflicts is carried over. > Users could be diligent in deleting conflicts, but still end up unable to > replicate their docs because of the volume of _deleted_conflicts. > > Robert, thanks for chiming in. I feel better knowing I'm in good company > with this problem. When this mine eventually goes off, couchdb is rendered > useless because beam.smp takes all the cpu. Is there any way to ration the > resources couchdb consumes? > >> B. >> >> On 14 March 2013 13:09, Riyad Kalla <[email protected]> wrote: >>> Stephen, >>> I am probably wrong here (someone hop in and correct me), but I thought >>> Compaction would remove the old revisions (and conflicts) of docs. >>> >>> Alternatively a question for the Couch devs, if Stephen set _revs_limit to >>> something artifically low, say 1, and restarted couch and did a compaction, >>> would that force the DB to smash down the datastore to 1 rev per doc and >>> remove the long-tail off these docs? >>> >>> REF: http://wiki.apache.org/couchdb/Compaction >>> >>> On Thu, Mar 14, 2013 at 2:02 AM, Stephen Bartell <[email protected]>wrote: >>> >>>> Hi all, >>>> >>>> tldr; I've got a database with just a couple docs. Conflict management >>>> went unchecked and these docs have thousands of conflicts each. >>>> Replication fails. Couch consumes all the server's cpu. >>>> >>>> First the story, then the questions. Please bear with me! >>>> >>>> I wanted to replicate this database to another, new database. So I >>>> started the replication. beam.smp took 100% of my cpu and the replicator >>>> status held steady at a constant percent for quite a while. It eventually >>>> finished. >>>> >>>> I thought maybe I should handle the conflicts and then replicate. >>>> Hopefuly it'll go faster next time. So I cleared all the conflicts. I >>>> replicated again but this time I could not get anything to replicate. >>>> Again, cpu held steady, topped out. I eventually restarted couch. >>>> >>>> I dug throughout the logs and saw that the POSTS were failing. I figure >>>> that the replicator was timing out when trying to post to couch. >>>> >>>> I have a replicator that I've been working on thats written in node.js. >>>> So I started that one up to do the same thing. I drew inspiration from >>>> Pouchdb's replicator and from Jens Alkes amazing replication algorithm >>>> documentation, so my replicator follows more or less the same story. 1) >>>> consume _changes with style=all_docs. 2) revs_diff on the target database. >>>> 3) get each revision from source with revs=true. 4) bulk post with >>>> new_edits=false. >>>> >>>> Same thing. Except now I can kind of make sense of whats going on. >>>> Sucking the data out of the source is no problem. Diffing the revs >>>> against the target is no problem. Posting the docs is THE problem. Since >>>> the database is clean, thousands of docs are being thrown at couch at once >>>> to build up the revision trees. Couch is just taking forever in finishing >>>> the job. It doesn't matter if I bulk post the docs or post them >>>> individually, couch sucks 100% of my cpu every time and takes forever to >>>> finish. (I actually never let it finish). >>>> >>>> So that is is the story. Here are my questions. >>>> >>>> 1) Has anyone else stepped on this mine? If so, could I get pointed >>>> towards some workarounds? I don't think it is right to make the assumption >>>> that users of couchdb will never have databases with huge conflict sausages >>>> like this. So simply saying manage your conflicts won't help. >>>> >>>> 2) Lets say I did manage my conflicts. I still have the >>>> _deleted_conflicts sausage. I know that _deleted and _deleted_docs must be >>>> replicated to maintain consistency across the cluster. If the replicator >>>> throws up when these huge sausages come through, how is the data ever going >>>> to replicate? Is there a trade secret I don't know about? >>>> >>>> 3) Is there any limit on the resources that CouchDB is allowed to consume? >>>> I can get that we run into these cases where theres tons of data to move >>>> and its just going to take a hell of a long time. But I don't get why its >>>> permissible for CouchDB to eat all my cpu. The whole server should never >>>> grind to a halt because its moving lots of data. I feel like it should be >>>> like the little train who could. Just chug along slow and steady until it >>>> crests the hill. >>>> >>>> I would really like to reply on the erlang replicator, but I can't. At >>>> least with the replicator I wrote I have a chance with throttling the posts >>>> so CouchDB doesn't render my server useless. >>>> >>>> Sorry for wrapping more questions into those questions. I'm pretty tired, >>>> stumped, and have machines in production crumbling. >>>> >>>> Best, >>>> Stephen >
