[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash
[ https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211933#comment-15211933 ] Nick Vatamaniuc commented on COUCHDB-2975: -- Noticed transient mode does not clean up child specs after it is done. Even if exit is normal. The intent behind that is to let users restart children. >From erlang docs saw this {{If the child is temporary, the child specification >is deleted as soon as the process terminates. This means that delete_child/2 >has no meaning, and restart_child/2 can not be used for these children.}} However in our code sometimes we explicitly delete child: {code} cancel_replication({BaseId, Extension}) -> ... case supervisor:terminate_child(couch_replicator_job_sup, FullRepId) of ok -> ... case supervisor:delete_child(couch_replicator_job_sup, FullRepId) of ok -> {ok, {cancelled, ?l2b(FullRepId)}}; ... {code} That would make it seem as if supervisor auto-deleted the child spec in some cases. To test that it doesn't start a normal replication (not a continuous one) and then after it is finished inspect the state of {{couch_replicator_job_sup}}. An example of state from supervisor after 10 replication have finished on a cluster: {code} {state, {local,couch_replicator_job_sup}, one_for_one, [{child,undefined,"ac35738f5003c02b6780116fdf04b524", {gen_server,start_link, [couch_replicator, {rep, {"ac35738f5003c02b6780116fdf04b524",[]}, {httpdb,"http://adm:pass@localhost:5984/rdyno_src_0001/;, nil, [{"Accept","application/json"}, {"User-Agent","CouchDB-Replicator/5fa9098"}], 20, [{socket_options,[{keepalive,true},{nodelay,false}]}], 1,250,nil,1}, {httpdb,"http://adm:pass@localhost:5984/rdyno_tgt_0009/;, nil, [{"Accept","application/json"}, {"User-Agent","CouchDB-Replicator/5fa9098"}], 20, [{socket_options,[{keepalive,true},{nodelay,false}]}], 1,250,nil,1}, [{checkpoint_interval,5000}, {connection_timeout,20}, {continuous,false}, {http_connections,1}, {retries,1}, {socket_options,[{keepalive,true},{nodelay,false}]}, {use_checkpoints,true}, {worker_batch_size,500}, {worker_processes,1}], {user_ctx,null,[],undefined}, db,nil, <<"rdyno_0001"...(15 B)>>, <<"shards/a00"...(47 B)>>}, [{timeout,20}]]}, transient,250,worker, [couch_replicator]}, {child,undefined,"6c48c1ab7a6e3ed5e3d4415ced912e4a", {gen_server,start_link, [couch_replicator, {rep, {"6c48c1ab7a6e3ed5e3d4415ced912e4a",[]}, {httpdb,"http://adm:pass@localhost:5984/rdyno_src_0001/;, nil, [{"Accept","application/json"}, {"User-Agent","CouchDB-Replicator/5fa9098"}], 20, [{socket_options,[{keepalive,true},{nodelay,false}]}], 1,250,nil,1}, {httpdb,"http://adm:pass@localhost:5984/rdyno_tgt_0002/;, nil, [{"Accept","application/json"}, {"User-Agent","CouchDB-Replicator/5fa9098"}], 20, [{socket_options,[{keepalive,true},{nodelay,false}]}], 1,250,nil,1}, [{checkpoint_interval,5000}, {connection_timeout,20}, {continuous,false}, {http_connections,1}, {retries,1}, {socket_options,[{keepalive,true},{nodelay,false}]}, {use_checkpoints,true}, {worker_batch_size,500}, {worker_processes,1}], {user_ctx,null,[],undefined}, db,nil, <<"rdyno_0001"...(15 B)>>, <<"shards/200"...(47 B)>>}, [{timeout,20}]]}, transient,250,worker, [couch_replicator]}], undefined,100,1,[],couch_replicator_job_sup,[]} {code} > Automatically restart replication jobs if they crash > > > Key: COUCHDB-2975 > URL: https://issues.apache.org/jira/browse/COUCHDB-2975 > Project: CouchDB > Issue Type: Improvement >
[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash
[ https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211101#comment-15211101 ] Nick Vatamaniuc commented on COUCHDB-2975: -- We might have to increase intensity threshold. One common use case that will trigger is one source to multiple targets replications. Source fails, So all replications will fail as well. Tested it with 1 source to 200 targets. Then killed the source and noticed supervisors were restarted: (node1@127.0.0.1)4> rpc:multicall(erlang, whereis, [couch_replicator_job_sup]). {[<0.352.0>,<26873.355.0>,<26910.354.0>],[]} % before deleting source (node1@127.0.0.1)5> rpc:multicall(erlang, whereis, [couch_replicator_job_sup]). {[<0.5617.4>,<26873.7071.3>,<26910.8924.3>],[]} % after deleting source Saw we already have some protection again failed repeated replication re-starts as the “max_replication_retry_count” parameter. By default it is 10. So 10 failed replication starts for a particular replication will cancel that replication. Once it successfully starts once, the failed retries number gets reset back to max (10). Another thing, noticed replications will restart even without {{transient}} supervisors if they are killed with an exit reason other than 'kill' (brutal kill). So if the goal is to just restart them, sending them exit(Pid, meh) should suffice. > Automatically restart replication jobs if they crash > > > Key: COUCHDB-2975 > URL: https://issues.apache.org/jira/browse/COUCHDB-2975 > Project: CouchDB > Issue Type: Improvement > Components: Replication >Reporter: Robert Newson > > We currently use the temporary restart strategy for replication jobs, which > means if they crash they are not restarted. > Instead, let's use the transient restart strategy, ensuring they are > restarted on abnormal termination, while still allowing these tasks to end > successfully on completion or cancellation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash
[ https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210728#comment-15210728 ] ASF subversion and git services commented on COUCHDB-2975: -- Commit eb93044ca4aa02ab4427ec7082df37bae6602973 in couchdb-couch-replicator's branch refs/heads/master from [~rnewson] [ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch-replicator.git;h=eb93044 ] Use transient restart type for all replications We want replication tasks to be restarted automatically if they crash abnormally. Replication tasks that complete or are cancelled (by deleting the backing _replicator doc or issuing an "cancel":true for non-persistent jobs) should still exit, should not be restarted, and should not have their child spec linger in the supervisor. COUCHDB-2975 > Automatically restart replication jobs if they crash > > > Key: COUCHDB-2975 > URL: https://issues.apache.org/jira/browse/COUCHDB-2975 > Project: CouchDB > Issue Type: Improvement > Components: Replication >Reporter: Robert Newson > > We currently use the temporary restart strategy for replication jobs, which > means if they crash they are not restarted. > Instead, let's use the transient restart strategy, ensuring they are > restarted on abnormal termination, while still allowing these tasks to end > successfully on completion or cancellation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash
[ https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210727#comment-15210727 ] ASF subversion and git services commented on COUCHDB-2975: -- Commit 73afc584bd10f68626d2049442b5a6058ff002db in couchdb-couch-replicator's branch refs/heads/master from [~rnewson] [ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch-replicator.git;h=73afc58 ] Remove obsoleted R14-era code We no longer support R14 so we're dropping R14-specific complications in the codebase. COUCHDB-2975 > Automatically restart replication jobs if they crash > > > Key: COUCHDB-2975 > URL: https://issues.apache.org/jira/browse/COUCHDB-2975 > Project: CouchDB > Issue Type: Improvement > Components: Replication >Reporter: Robert Newson > > We currently use the temporary restart strategy for replication jobs, which > means if they crash they are not restarted. > Instead, let's use the transient restart strategy, ensuring they are > restarted on abnormal termination, while still allowing these tasks to end > successfully on completion or cancellation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash
[ https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210730#comment-15210730 ] ASF GitHub Bot commented on COUCHDB-2975: - Github user asfgit closed the pull request at: https://github.com/apache/couchdb-couch-replicator/pull/33 > Automatically restart replication jobs if they crash > > > Key: COUCHDB-2975 > URL: https://issues.apache.org/jira/browse/COUCHDB-2975 > Project: CouchDB > Issue Type: Improvement > Components: Replication >Reporter: Robert Newson > > We currently use the temporary restart strategy for replication jobs, which > means if they crash they are not restarted. > Instead, let's use the transient restart strategy, ensuring they are > restarted on abnormal termination, while still allowing these tasks to end > successfully on completion or cancellation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash
[ https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210729#comment-15210729 ] ASF subversion and git services commented on COUCHDB-2975: -- Commit 4cb517659c235c06a39ee7eb6b4150cdfded6116 in couchdb-couch-replicator's branch refs/heads/master from [~rnewson] [ https://git-wip-us.apache.org/repos/asf?p=couchdb-couch-replicator.git;h=4cb5176 ] Reduce likelihood of a bad replication job taking down the job supervisor While we can't disable max_restart_intensity, we can make it unlikely to happen. Ordinarily, we would want this behaviour, but replication jobs involve human input. A bad password, or malformed url, etc, can cause repeated and fast crashing. For now, we require ten crashes within one second before we would bounce the job supervisor. In future, we should manage replication jobs with greater care. COUCHDB-2975 > Automatically restart replication jobs if they crash > > > Key: COUCHDB-2975 > URL: https://issues.apache.org/jira/browse/COUCHDB-2975 > Project: CouchDB > Issue Type: Improvement > Components: Replication >Reporter: Robert Newson > > We currently use the temporary restart strategy for replication jobs, which > means if they crash they are not restarted. > Instead, let's use the transient restart strategy, ensuring they are > restarted on abnormal termination, while still allowing these tasks to end > successfully on completion or cancellation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash
[ https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210238#comment-15210238 ] ASF GitHub Bot commented on COUCHDB-2975: - GitHub user rnewson opened a pull request: https://github.com/apache/couchdb-couch-replicator/pull/33 restart replications on crash COUCHDB-2975 You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloudant/couchdb-couch-replicator 2975-restart-replications-on-crash Alternatively you can review and apply these changes as the patch at: https://github.com/apache/couchdb-couch-replicator/pull/33.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #33 commit 73afc584bd10f68626d2049442b5a6058ff002db Author: Robert NewsonDate: 2016-03-24T11:26:08Z Remove obsoleted R14-era code We no longer support R14 so we're dropping R14-specific complications in the codebase. COUCHDB-2975 commit eb93044ca4aa02ab4427ec7082df37bae6602973 Author: Robert Newson Date: 2016-03-24T11:29:10Z Use transient restart type for all replications We want replication tasks to be restarted automatically if they crash abnormally. Replication tasks that complete or are cancelled (by deleting the backing _replicator doc or issuing an "cancel":true for non-persistent jobs) should still exit, should not be restarted, and should not have their child spec linger in the supervisor. COUCHDB-2975 commit 4cb517659c235c06a39ee7eb6b4150cdfded6116 Author: Robert Newson Date: 2016-03-24T13:40:14Z Reduce likelihood of a bad replication job taking down the job supervisor While we can't disable max_restart_intensity, we can make it unlikely to happen. Ordinarily, we would want this behaviour, but replication jobs involve human input. A bad password, or malformed url, etc, can cause repeated and fast crashing. For now, we require ten crashes within one second before we would bounce the job supervisor. In future, we should manage replication jobs with greater care. COUCHDB-2975 > Automatically restart replication jobs if they crash > > > Key: COUCHDB-2975 > URL: https://issues.apache.org/jira/browse/COUCHDB-2975 > Project: CouchDB > Issue Type: Improvement > Components: Replication >Reporter: Robert Newson > > We currently use the temporary restart strategy for replication jobs, which > means if they crash they are not restarted. > Instead, let's use the transient restart strategy, ensuring they are > restarted on abnormal termination, while still allowing these tasks to end > successfully on completion or cancellation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)