Re: replicating docs with tons of conflicts

Robert Newson Thu, 14 Mar 2013 14:18:58 -0700

One trick, you can delete the doc and replicate with a filter like
'return !doc['_deleted'];' that blocks all deletes. the target db will
then not receive any trace of these highly conflicted docs.


On 14 March 2013 14:10, Stephen Bartell <[email protected]> wrote:
>
> On Mar 14, 2013, at 11:44 AM, Robert Newson <[email protected]> wrote:
>
>> Conflicts are *not* removed during compaction, CouchDB has no way of
>> knowing which ones it would be ok to delete.
>
> Yep, they need to be deleted in the context of the person/process 
> manipulating the docs.
>
>>
>> CouchDB does struggle to process documents with lots of conflicts,
>> we've encountered this at Cloudant a fair bunch. We resolve the
>> conflicts via http if possible or, if that consistently fails, with a
>> direct erlang manipulation. It's certainly something we need to
>> improve.
>>
>
> But even deleting them yields the same problem.  When replicating, the 
> _deleted_conflicts is carried over.
> Users could be diligent in deleting conflicts, but still end up unable to 
> replicate their docs because of the volume of _deleted_conflicts.
>
> Robert, thanks for chiming in.  I feel better knowing I'm in good company 
> with this problem. When this mine eventually goes off, couchdb is rendered 
> useless because beam.smp takes all the cpu.  Is there any way to ration the 
> resources couchdb consumes?
>
>> B.
>>
>> On 14 March 2013 13:09, Riyad Kalla <[email protected]> wrote:
>>> Stephen,
>>> I am probably wrong here (someone hop in and correct me), but I thought
>>> Compaction would remove the old revisions (and conflicts) of docs.
>>>
>>> Alternatively a question for the Couch devs, if Stephen set _revs_limit to
>>> something artifically low, say 1, and restarted couch and did a compaction,
>>> would that force the DB to smash down the datastore to 1 rev per doc and
>>> remove the long-tail off these docs?
>>>
>>> REF: http://wiki.apache.org/couchdb/Compaction
>>>
>>> On Thu, Mar 14, 2013 at 2:02 AM, Stephen Bartell <[email protected]>wrote:
>>>
>>>> Hi all,
>>>>
>>>> tldr; I've got a database with just a couple docs.  Conflict management
>>>> went unchecked and these docs have thousands of conflicts each.
>>>> Replication fails.  Couch consumes all the server's cpu.
>>>>
>>>> First the story, then the questions.  Please bear with me!
>>>>
>>>> I wanted to replicate this database to another, new database.  So I
>>>> started the replication.  beam.smp took 100% of my cpu and the replicator
>>>> status held steady at a constant percent for quite a while.  It eventually
>>>> finished.
>>>>
>>>> I thought maybe I should handle the conflicts and then replicate.
>>>> Hopefuly it'll go faster next time.  So I cleared all the conflicts.  I
>>>> replicated again but this time I could not get anything to replicate.
>>>> Again, cpu held steady, topped out. I eventually restarted couch.
>>>>
>>>> I dug throughout the logs and saw that the POSTS were failing.  I figure
>>>> that the replicator was timing out when trying to post to couch.
>>>>
>>>> I have a replicator that I've been working on thats written in node.js.
>>>> So I started that one up to do the same thing.  I drew inspiration from
>>>> Pouchdb's replicator and from Jens Alkes amazing replication algorithm
>>>> documentation, so my replicator follows more or less the same story.  1)
>>>> consume _changes with style=all_docs.  2) revs_diff on the target database.
>>>> 3) get each revision from source with revs=true.  4) bulk post with
>>>> new_edits=false.
>>>>
>>>> Same thing.  Except now I can kind of make sense of whats going on.
>>>> Sucking the data out of the source is no problem.  Diffing the revs
>>>> against the target is no problem.  Posting the docs is THE problem.  Since
>>>> the database is clean, thousands of docs are being thrown at couch at once
>>>> to build up the revision trees.  Couch is just taking forever in finishing
>>>> the job.  It doesn't matter if I bulk post the docs or post them
>>>> individually, couch sucks 100% of my cpu every time and takes forever to
>>>> finish. (I actually never let it finish).
>>>>
>>>> So that is is the story. Here are my questions.
>>>>
>>>> 1) Has anyone else stepped on this mine?  If so, could I get pointed
>>>> towards some workarounds?  I don't think it is right to make the assumption
>>>> that users of couchdb will never have databases with huge conflict sausages
>>>> like this. So simply saying manage your conflicts won't help.
>>>>
>>>> 2) Lets say I did manage my conflicts.  I still have the
>>>> _deleted_conflicts sausage.  I know that _deleted and _deleted_docs must be
>>>> replicated to maintain consistency across the cluster.  If the replicator
>>>> throws up when these huge sausages come through, how is the data ever going
>>>> to replicate?  Is there a trade secret I don't know about?
>>>>
>>>> 3) Is there any limit on the resources that CouchDB is allowed to consume?
>>>> I can get that we run into these cases where theres tons of data to move
>>>> and its just going to take a hell of a long time.  But I don't get why its
>>>> permissible for CouchDB to eat all my cpu.  The whole server should never
>>>> grind to a halt because its moving lots of data.  I feel like it should be
>>>> like the little train who could.  Just chug along slow and steady until it
>>>> crests the hill.
>>>>
>>>> I would really like to reply on the erlang replicator, but I can't.  At
>>>> least with the replicator I wrote I have a chance with throttling the posts
>>>> so CouchDB doesn't render my server useless.
>>>>
>>>> Sorry for wrapping more questions into those questions.  I'm pretty tired,
>>>> stumped, and have machines in production crumbling.
>>>>
>>>> Best,
>>>> Stephen
>

Re: replicating docs with tons of conflicts

Reply via email to