Re: replicating docs with tons of conflicts

Robert Newson Thu, 14 Mar 2013 15:36:59 -0700

Runaway processes are the very devil but the problem is not specific
to CouchDB, there is no CouchDB mechanism for this just as there's no
bash/python/ruby/perl method to limit a while(true){} loop.


Highly conflicted documents are painful to update and read. I can't do
anything about that today.

B.

On 14 March 2013 17:23, Stephen Bartell <[email protected]> wrote:
> Robert, this only works if I don't need to keep those docs around anymore.  
> In my case, I want to keep the docs.  I don't want to keep the conflicts of 
> the docs. Most importantly thought, even if I delete all the conflicts on all 
> my docs, I still have the problem of _deleted_docs.  What I've seen is that 
> only a few docs with a few thousand _deleted_docs each will plug up Couch and 
> render unusable. You can't get rid of it through natural means.
>
> This is what Riyad was bringing up and what Ive implemented.  I have a 
> program which replicates from the troubled database _changes with the query 
> param style=main_only.  This allows me to still have the revision tree of the 
> troubled database, but without the _deleted_conflicts.  I can then wipe out 
> the troubled db, recreate it, and replicate the shiny clean data back into it.
>
> This is unnatural and requires custom code to make happen.  I can live with 
> it until a better solution comes around.
>
> What I'm really concerned about is how couchdb eats all my cpu.
>
> Is there any way to ration the resources that couchdb uses? Like tell it not 
> to use more than 50% or something.  I think that couch eating all the 
> resources on a machine just because its reading loads of data is a bug.  Is 
> this a reasonable conclusion?
>
> On Mar 14, 2013, at 2:18 PM, Robert Newson <[email protected]> wrote:
>
>> One trick, you can delete the doc and replicate with a filter like
>> 'return !doc['_deleted'];' that blocks all deletes. the target db will
>> then not receive any trace of these highly conflicted docs.
>>
>> On 14 March 2013 14:10, Stephen Bartell <[email protected]> wrote:
>>>
>>> On Mar 14, 2013, at 11:44 AM, Robert Newson <[email protected]> wrote:
>>>
>>>> Conflicts are *not* removed during compaction, CouchDB has no way of
>>>> knowing which ones it would be ok to delete.
>>>
>>> Yep, they need to be deleted in the context of the person/process 
>>> manipulating the docs.
>>>
>>>>
>>>> CouchDB does struggle to process documents with lots of conflicts,
>>>> we've encountered this at Cloudant a fair bunch. We resolve the
>>>> conflicts via http if possible or, if that consistently fails, with a
>>>> direct erlang manipulation. It's certainly something we need to
>>>> improve.
>>>>
>>>
>>> But even deleting them yields the same problem.  When replicating, the 
>>> _deleted_conflicts is carried over.
>>> Users could be diligent in deleting conflicts, but still end up unable to 
>>> replicate their docs because of the volume of _deleted_conflicts.
>>>
>>> Robert, thanks for chiming in.  I feel better knowing I'm in good company 
>>> with this problem. When this mine eventually goes off, couchdb is rendered 
>>> useless because beam.smp takes all the cpu.  Is there any way to ration the 
>>> resources couchdb consumes?
>>>
>>>> B.
>>>>
>>>> On 14 March 2013 13:09, Riyad Kalla <[email protected]> wrote:
>>>>> Stephen,
>>>>> I am probably wrong here (someone hop in and correct me), but I thought
>>>>> Compaction would remove the old revisions (and conflicts) of docs.
>>>>>
>>>>> Alternatively a question for the Couch devs, if Stephen set _revs_limit to
>>>>> something artifically low, say 1, and restarted couch and did a 
>>>>> compaction,
>>>>> would that force the DB to smash down the datastore to 1 rev per doc and
>>>>> remove the long-tail off these docs?
>>>>>
>>>>> REF: http://wiki.apache.org/couchdb/Compaction
>>>>>
>>>>> On Thu, Mar 14, 2013 at 2:02 AM, Stephen Bartell 
>>>>> <[email protected]>wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> tldr; I've got a database with just a couple docs.  Conflict management
>>>>>> went unchecked and these docs have thousands of conflicts each.
>>>>>> Replication fails.  Couch consumes all the server's cpu.
>>>>>>
>>>>>> First the story, then the questions.  Please bear with me!
>>>>>>
>>>>>> I wanted to replicate this database to another, new database.  So I
>>>>>> started the replication.  beam.smp took 100% of my cpu and the replicator
>>>>>> status held steady at a constant percent for quite a while.  It 
>>>>>> eventually
>>>>>> finished.
>>>>>>
>>>>>> I thought maybe I should handle the conflicts and then replicate.
>>>>>> Hopefuly it'll go faster next time.  So I cleared all the conflicts.  I
>>>>>> replicated again but this time I could not get anything to replicate.
>>>>>> Again, cpu held steady, topped out. I eventually restarted couch.
>>>>>>
>>>>>> I dug throughout the logs and saw that the POSTS were failing.  I figure
>>>>>> that the replicator was timing out when trying to post to couch.
>>>>>>
>>>>>> I have a replicator that I've been working on thats written in node.js.
>>>>>> So I started that one up to do the same thing.  I drew inspiration from
>>>>>> Pouchdb's replicator and from Jens Alkes amazing replication algorithm
>>>>>> documentation, so my replicator follows more or less the same story.  1)
>>>>>> consume _changes with style=all_docs.  2) revs_diff on the target 
>>>>>> database.
>>>>>> 3) get each revision from source with revs=true.  4) bulk post with
>>>>>> new_edits=false.
>>>>>>
>>>>>> Same thing.  Except now I can kind of make sense of whats going on.
>>>>>> Sucking the data out of the source is no problem.  Diffing the revs
>>>>>> against the target is no problem.  Posting the docs is THE problem.  
>>>>>> Since
>>>>>> the database is clean, thousands of docs are being thrown at couch at 
>>>>>> once
>>>>>> to build up the revision trees.  Couch is just taking forever in 
>>>>>> finishing
>>>>>> the job.  It doesn't matter if I bulk post the docs or post them
>>>>>> individually, couch sucks 100% of my cpu every time and takes forever to
>>>>>> finish. (I actually never let it finish).
>>>>>>
>>>>>> So that is is the story. Here are my questions.
>>>>>>
>>>>>> 1) Has anyone else stepped on this mine?  If so, could I get pointed
>>>>>> towards some workarounds?  I don't think it is right to make the 
>>>>>> assumption
>>>>>> that users of couchdb will never have databases with huge conflict 
>>>>>> sausages
>>>>>> like this. So simply saying manage your conflicts won't help.
>>>>>>
>>>>>> 2) Lets say I did manage my conflicts.  I still have the
>>>>>> _deleted_conflicts sausage.  I know that _deleted and _deleted_docs must 
>>>>>> be
>>>>>> replicated to maintain consistency across the cluster.  If the replicator
>>>>>> throws up when these huge sausages come through, how is the data ever 
>>>>>> going
>>>>>> to replicate?  Is there a trade secret I don't know about?
>>>>>>
>>>>>> 3) Is there any limit on the resources that CouchDB is allowed to 
>>>>>> consume?
>>>>>> I can get that we run into these cases where theres tons of data to move
>>>>>> and its just going to take a hell of a long time.  But I don't get why 
>>>>>> its
>>>>>> permissible for CouchDB to eat all my cpu.  The whole server should never
>>>>>> grind to a halt because its moving lots of data.  I feel like it should 
>>>>>> be
>>>>>> like the little train who could.  Just chug along slow and steady until 
>>>>>> it
>>>>>> crests the hill.
>>>>>>
>>>>>> I would really like to reply on the erlang replicator, but I can't.  At
>>>>>> least with the replicator I wrote I have a chance with throttling the 
>>>>>> posts
>>>>>> so CouchDB doesn't render my server useless.
>>>>>>
>>>>>> Sorry for wrapping more questions into those questions.  I'm pretty 
>>>>>> tired,
>>>>>> stumped, and have machines in production crumbling.
>>>>>>
>>>>>> Best,
>>>>>> Stephen
>>>
>

Re: replicating docs with tons of conflicts

Reply via email to