Re: Replication hangs

Adam Kocoloski Mon, 19 Oct 2009 07:37:51 -0700

On Oct 19, 2009, at 10:14 AM, Simon Eisenmann wrote:

Am Montag, den 19.10.2009, 10:04 -0400 schrieb Adam Kocoloski:

On Oct 19, 2009, at 10:00 AM, Simon Eisenmann wrote:

Paul,

Am Montag, den 19.10.2009, 09:53 -0400 schrieb Paul Davis:

Hmmm, that sounds most odd. Are there any consistencies on when it
hangs? Specifically, does it look like its a poison doc that causes
things to go wonky or some such? Do nodes fail in a specific order?


The only specificness i see is that somehow the slowest node never
seems
to fail. The other two nodes have roughly the same performance.

Also, you might try setting up the continuous replication insteadof
the update notifications as that might be a bit more ironed out.


I already have considered that, though as long there is no way to
figure
out if a continous replication is still up and running i cannot use
it,

cause i have to restart it when a node fails and comes up againlater.

Another thing to check is if its just the task status that'swonky vs
actual replication. You can check the _local doc that's created by
replication to see if its update seq is changing while taskstatuses
aren't.


If only the status would hang, i should be able to start up the
replication again correct? Though this hangs as well.


Hi Simon, is this hang related to the accept_failed bug report you
just filed[1], or is it separate? Best,

Adam

[1]: https://issues.apache.org/jira/browse/COUCHDB-536


Hi Adam,

i would consider it separate. The accept_failed issue happens onlywhen

having lots and lots of changes

(essentially while True { put couple of docs, query views, deletedocs})


Simon

So, until JIRA comes back online I'll follow up with that here. Ithink I could see how repeated pull replications in rapid successioncould end up blowing through sockets. Each pull replication sets upone new connection for the _changes feed, and tears it down at the end(everything else replication-related goes through a connection pool).Do enough of those very short requests and you could end up with lotsof connections in TIME_WAIT and eventually run out of sockets. FWIW,the default Erlang limit is slightly less than 1024. If yourupdate_notification process uses a new connection for every POST to_replicate you'll hit the system limit (also 1024 in Ubuntu IIRC)twice as fast.

Continuous replication is really our preferred solution for yourscenario. If you can live with interpreting the records in the _localdocument to verify that it's still running you'll end up with a moreefficient replication system all around.

Regarding the hangs, if you do write a test script I'll be more thanhappy to try it and figure out what's going wrong. Best,


Adam

Re: Replication hangs

Reply via email to