What max_dbs_open value do I need to avoid the checkpoint_commit_failure errors?

Andreas Kemkes Mon, 16 Jul 2012 17:16:30 -0700

The current max_dbs_open value is set at 600.

The server is running 112 continuous replications with the following topology:


                 +-->  F001
S(*)  --->  T  --|     ...
                 +-->  F111

(*) S is on a different host

On the first data change at the source database, the following issue was logged 
and the replication between S and T died:

{checkpoint_commit_failure,<<"Target database out of sync. Try to increase 
max_dbs_open at the target's server.">>}


One of the filtered replications between T and Fn died as well 2 seconds later 
with the same checkpoint_commit_failure issue.  I suspect that it was the one 
that let the new document through its filter, but cannot verify.


Upon restart of the replication between S and T, it ran to completion, but 
several of the filtered replications died with the same issue from above.  I 
suspect that all filtered replications that let the new documents through their 
filters were affected, but cannot verify.


After starting the failed filtered replications once more, everything runs to 
completion.

Another change triggers the following issue, yet the replication keeps running 
and the filtered replication does not show any sign of issue:

{checkpoint_commit_failure,<<"Error updating the source checkpoint document: 
conflict">>}


...

[Mon, 16 Jul 2012 23:34:10 GMT] [info] [<0.27578.249>] recording a checkpoint 
for `S` -> `T` at source update_seq 169029

...
[Mon, 16 Jul 2012 23:34:17 GMT] [info] [<0.28279.247>] recording a checkpoint 
for `T` -> `http://Fx` at source update_seq 52930
...

Subsequent changes at the source do not trigger any other errors in the log 
files.

Is this last issue related to the previous ones or just coincidental?
Is there a formula that allows me to project the value I need to chose 
for max_dbs_open?

What is the reason that the value of 600 appears to be too low?

I also see a lot of 'GET /llfs/ 200' in the logs, probably originating from the 
112 replication - it appears they poll every 5 seconds.

Is there a parameter to reduce the interval?  I've looked and couldn't find it, 
but might have missed it.

One other thing I noticed is that if you start 2 continuous replications, one 
with 'create_target': true, another w/o the parameter, the replications are 
treated as different and not recognized as 'already running'.  In my opinion, 
as 'create_target' is a null operation with an already created database, they 
should be recognized as 'already running'.  What happens in the case of 2 
identical replications running?


-- Andreas

What max_dbs_open value do I need to avoid the checkpoint_commit_failure errors?

Reply via email to