[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

2016-03-25 Thread Nick Vatamaniuc (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211933#comment-15211933
 ] 

Nick Vatamaniuc commented on COUCHDB-2975:
--

Noticed transient mode does not clean up child specs after it is done. Even if 
exit is normal. The intent behind that is to let users restart children.

>From erlang docs saw this {{If the child is temporary, the child specification 
>is deleted as soon as the process terminates. This means that delete_child/2 
>has no meaning, and restart_child/2 can not be used for these children.}}

However in our code sometimes we explicitly delete child:

{code}
cancel_replication({BaseId, Extension}) ->
...
case supervisor:terminate_child(couch_replicator_job_sup, FullRepId) of
ok ->
...
case supervisor:delete_child(couch_replicator_job_sup, FullRepId) of
ok ->
{ok, {cancelled, ?l2b(FullRepId)}};
  ...
{code}

That would make it seem as if supervisor auto-deleted the child spec in some 
cases. To test that it doesn't start a normal replication (not a continuous 
one) and then after it is finished inspect the state of 
{{couch_replicator_job_sup}}.

An example of state from supervisor after 10 replication have finished on a 
cluster:

{code}
{state,
{local,couch_replicator_job_sup},
one_for_one,
[{child,undefined,"ac35738f5003c02b6780116fdf04b524",
 {gen_server,start_link,
 [couch_replicator,
  {rep,
  {"ac35738f5003c02b6780116fdf04b524",[]},
  {httpdb,"http://adm:pass@localhost:5984/rdyno_src_0001/;,
  nil,
  [{"Accept","application/json"},
   {"User-Agent","CouchDB-Replicator/5fa9098"}],
  20,
  [{socket_options,[{keepalive,true},{nodelay,false}]}],
  1,250,nil,1},
  {httpdb,"http://adm:pass@localhost:5984/rdyno_tgt_0009/;,
  nil,
  [{"Accept","application/json"},
   {"User-Agent","CouchDB-Replicator/5fa9098"}],
  20,
  [{socket_options,[{keepalive,true},{nodelay,false}]}],
  1,250,nil,1},
  [{checkpoint_interval,5000},
   {connection_timeout,20},
   {continuous,false},
   {http_connections,1},
   {retries,1},
   {socket_options,[{keepalive,true},{nodelay,false}]},
   {use_checkpoints,true},
   {worker_batch_size,500},
   {worker_processes,1}],
  {user_ctx,null,[],undefined},
  db,nil,
  <<"rdyno_0001"...(15 B)>>,
  <<"shards/a00"...(47 B)>>},
  [{timeout,20}]]},
 transient,250,worker,
 [couch_replicator]},
 {child,undefined,"6c48c1ab7a6e3ed5e3d4415ced912e4a",
 {gen_server,start_link,
 [couch_replicator,
  {rep,
  {"6c48c1ab7a6e3ed5e3d4415ced912e4a",[]},
  {httpdb,"http://adm:pass@localhost:5984/rdyno_src_0001/;,
  nil,
  [{"Accept","application/json"},
   {"User-Agent","CouchDB-Replicator/5fa9098"}],
  20,
  [{socket_options,[{keepalive,true},{nodelay,false}]}],
  1,250,nil,1},
  {httpdb,"http://adm:pass@localhost:5984/rdyno_tgt_0002/;,
  nil,
  [{"Accept","application/json"},
   {"User-Agent","CouchDB-Replicator/5fa9098"}],
  20,
  [{socket_options,[{keepalive,true},{nodelay,false}]}],
  1,250,nil,1},
  [{checkpoint_interval,5000},
   {connection_timeout,20},
   {continuous,false},
   {http_connections,1},
   {retries,1},
   {socket_options,[{keepalive,true},{nodelay,false}]},
   {use_checkpoints,true},
   {worker_batch_size,500},
   {worker_processes,1}],
  {user_ctx,null,[],undefined},
  db,nil,
  <<"rdyno_0001"...(15 B)>>,
  <<"shards/200"...(47 B)>>},
  [{timeout,20}]]},
 transient,250,worker,
 [couch_replicator]}],
undefined,100,1,[],couch_replicator_job_sup,[]}
{code}



> Automatically restart replication jobs if they crash
> 
>
> Key: COUCHDB-2975
> URL: https://issues.apache.org/jira/browse/COUCHDB-2975
> Project: CouchDB
>  Issue Type: Improvement
>  

[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

2016-03-24 Thread Nick Vatamaniuc (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15211101#comment-15211101
 ] 

Nick Vatamaniuc commented on COUCHDB-2975:
--

We might have to increase intensity threshold.  One common use case that will 
trigger is one source to multiple targets replications. Source fails, So all 
replications will fail as well. Tested it with 1 source to 200 targets. Then 
killed the source and noticed supervisors were restarted:

(node1@127.0.0.1)4> rpc:multicall(erlang, whereis, [couch_replicator_job_sup]).
{[<0.352.0>,<26873.355.0>,<26910.354.0>],[]} % before deleting source
(node1@127.0.0.1)5> rpc:multicall(erlang, whereis, [couch_replicator_job_sup]).
{[<0.5617.4>,<26873.7071.3>,<26910.8924.3>],[]} % after deleting source

Saw we already have some protection again failed repeated replication re-starts 
as the “max_replication_retry_count” parameter. By default it is 10. So 10 
failed replication starts for a particular replication will cancel that 
replication. Once it successfully starts once, the failed retries number gets 
reset back to max (10).

Another thing, noticed replications will restart even without {{transient}} 
supervisors if they are killed with an exit reason other than 'kill' (brutal 
kill). So if the goal is to just restart them, sending them exit(Pid, meh) 
should suffice. 

> Automatically restart replication jobs if they crash
> 
>
> Key: COUCHDB-2975
> URL: https://issues.apache.org/jira/browse/COUCHDB-2975
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Robert Newson
>
> We currently use the temporary restart strategy for replication jobs, which 
> means if they crash they are not restarted.
> Instead, let's use the transient restart strategy, ensuring they are 
> restarted on abnormal termination, while still allowing these tasks to end 
> successfully on completion or cancellation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

2016-03-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210728#comment-15210728
 ] 

ASF subversion and git services commented on COUCHDB-2975:
--

Commit eb93044ca4aa02ab4427ec7082df37bae6602973 in couchdb-couch-replicator's 
branch refs/heads/master from [~rnewson]
[ 
https://git-wip-us.apache.org/repos/asf?p=couchdb-couch-replicator.git;h=eb93044
 ]

Use transient restart type for all replications

We want replication tasks to be restarted automatically if they crash
abnormally. Replication tasks that complete or are cancelled (by
deleting the backing _replicator doc or issuing an "cancel":true for
non-persistent jobs) should still exit, should not be restarted, and
should not have their child spec linger in the supervisor.

COUCHDB-2975


> Automatically restart replication jobs if they crash
> 
>
> Key: COUCHDB-2975
> URL: https://issues.apache.org/jira/browse/COUCHDB-2975
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Robert Newson
>
> We currently use the temporary restart strategy for replication jobs, which 
> means if they crash they are not restarted.
> Instead, let's use the transient restart strategy, ensuring they are 
> restarted on abnormal termination, while still allowing these tasks to end 
> successfully on completion or cancellation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

2016-03-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210727#comment-15210727
 ] 

ASF subversion and git services commented on COUCHDB-2975:
--

Commit 73afc584bd10f68626d2049442b5a6058ff002db in couchdb-couch-replicator's 
branch refs/heads/master from [~rnewson]
[ 
https://git-wip-us.apache.org/repos/asf?p=couchdb-couch-replicator.git;h=73afc58
 ]

Remove obsoleted R14-era code

We no longer support R14 so we're dropping R14-specific complications
in the codebase.

COUCHDB-2975


> Automatically restart replication jobs if they crash
> 
>
> Key: COUCHDB-2975
> URL: https://issues.apache.org/jira/browse/COUCHDB-2975
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Robert Newson
>
> We currently use the temporary restart strategy for replication jobs, which 
> means if they crash they are not restarted.
> Instead, let's use the transient restart strategy, ensuring they are 
> restarted on abnormal termination, while still allowing these tasks to end 
> successfully on completion or cancellation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

2016-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210730#comment-15210730
 ] 

ASF GitHub Bot commented on COUCHDB-2975:
-

Github user asfgit closed the pull request at:

https://github.com/apache/couchdb-couch-replicator/pull/33


> Automatically restart replication jobs if they crash
> 
>
> Key: COUCHDB-2975
> URL: https://issues.apache.org/jira/browse/COUCHDB-2975
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Robert Newson
>
> We currently use the temporary restart strategy for replication jobs, which 
> means if they crash they are not restarted.
> Instead, let's use the transient restart strategy, ensuring they are 
> restarted on abnormal termination, while still allowing these tasks to end 
> successfully on completion or cancellation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

2016-03-24 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210729#comment-15210729
 ] 

ASF subversion and git services commented on COUCHDB-2975:
--

Commit 4cb517659c235c06a39ee7eb6b4150cdfded6116 in couchdb-couch-replicator's 
branch refs/heads/master from [~rnewson]
[ 
https://git-wip-us.apache.org/repos/asf?p=couchdb-couch-replicator.git;h=4cb5176
 ]

Reduce likelihood of a bad replication job taking down the job supervisor

While we can't disable max_restart_intensity, we can make it unlikely
to happen. Ordinarily, we would want this behaviour, but replication
jobs involve human input. A bad password, or malformed url, etc, can
cause repeated and fast crashing.

For now, we require ten crashes within one second before we would
bounce the job supervisor. In future, we should manage replication
jobs with greater care.

COUCHDB-2975


> Automatically restart replication jobs if they crash
> 
>
> Key: COUCHDB-2975
> URL: https://issues.apache.org/jira/browse/COUCHDB-2975
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Robert Newson
>
> We currently use the temporary restart strategy for replication jobs, which 
> means if they crash they are not restarted.
> Instead, let's use the transient restart strategy, ensuring they are 
> restarted on abnormal termination, while still allowing these tasks to end 
> successfully on completion or cancellation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

2016-03-24 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210238#comment-15210238
 ] 

ASF GitHub Bot commented on COUCHDB-2975:
-

GitHub user rnewson opened a pull request:

https://github.com/apache/couchdb-couch-replicator/pull/33

restart replications on crash

COUCHDB-2975 

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloudant/couchdb-couch-replicator 
2975-restart-replications-on-crash

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/couchdb-couch-replicator/pull/33.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #33


commit 73afc584bd10f68626d2049442b5a6058ff002db
Author: Robert Newson 
Date:   2016-03-24T11:26:08Z

Remove obsoleted R14-era code

We no longer support R14 so we're dropping R14-specific complications
in the codebase.

COUCHDB-2975

commit eb93044ca4aa02ab4427ec7082df37bae6602973
Author: Robert Newson 
Date:   2016-03-24T11:29:10Z

Use transient restart type for all replications

We want replication tasks to be restarted automatically if they crash
abnormally. Replication tasks that complete or are cancelled (by
deleting the backing _replicator doc or issuing an "cancel":true for
non-persistent jobs) should still exit, should not be restarted, and
should not have their child spec linger in the supervisor.

COUCHDB-2975

commit 4cb517659c235c06a39ee7eb6b4150cdfded6116
Author: Robert Newson 
Date:   2016-03-24T13:40:14Z

Reduce likelihood of a bad replication job taking down the job supervisor

While we can't disable max_restart_intensity, we can make it unlikely
to happen. Ordinarily, we would want this behaviour, but replication
jobs involve human input. A bad password, or malformed url, etc, can
cause repeated and fast crashing.

For now, we require ten crashes within one second before we would
bounce the job supervisor. In future, we should manage replication
jobs with greater care.

COUCHDB-2975




> Automatically restart replication jobs if they crash
> 
>
> Key: COUCHDB-2975
> URL: https://issues.apache.org/jira/browse/COUCHDB-2975
> Project: CouchDB
>  Issue Type: Improvement
>  Components: Replication
>Reporter: Robert Newson
>
> We currently use the temporary restart strategy for replication jobs, which 
> means if they crash they are not restarted.
> Instead, let's use the transient restart strategy, ensuring they are 
> restarted on abnormal termination, while still allowing these tasks to end 
> successfully on completion or cancellation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)