[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

Nick Vatamaniuc (JIRA) Fri, 25 Mar 2016 08:22:04 -0700

    [ 
https://issues.apache.org/jira/browse/COUCHDB-2975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15211933#comment-15211933
 ]


Nick Vatamaniuc commented on COUCHDB-2975:
------------------------------------------

Noticed transient mode does not clean up child specs after it is done. Even if 
exit is normal. The intent behind that is to let users restart children.

>From erlang docs saw this {{If the child is temporary, the child specification 
>is deleted as soon as the process terminates. This means that delete_child/2 
>has no meaning, and restart_child/2 can not be used for these children.}}

However in our code sometimes we explicitly delete child:

{code}
cancel_replication({BaseId, Extension}) ->
    ...
    case supervisor:terminate_child(couch_replicator_job_sup, FullRepId) of
    ok ->
        ...
        case supervisor:delete_child(couch_replicator_job_sup, FullRepId) of
            ok ->
                {ok, {cancelled, ?l2b(FullRepId)}};
              ...
{code}

That would make it seem as if supervisor auto-deleted the child spec in some 
cases. To test that it doesn't start a normal replication (not a continuous 
one) and then after it is finished inspect the state of 
{{couch_replicator_job_sup}}.

An example of state from supervisor after 10 replication have finished on a 
cluster:

{code}
{state,
    {local,couch_replicator_job_sup},
    one_for_one,
    [{child,undefined,"ac35738f5003c02b6780116fdf04b524",
         {gen_server,start_link,
             [couch_replicator,
              {rep,
                  {"ac35738f5003c02b6780116fdf04b524",[]},
                  {httpdb,"http://adm:pass@localhost:5984/rdyno_src_0001/";,
                      nil,
                      [{"Accept","application/json"},
                       {"User-Agent","CouchDB-Replicator/5fa9098"}],
                      200000,
                      [{socket_options,[{keepalive,true},{nodelay,false}]}],
                      1,250,nil,1},
                  {httpdb,"http://adm:pass@localhost:5984/rdyno_tgt_0009/";,
                      nil,
                      [{"Accept","application/json"},
                       {"User-Agent","CouchDB-Replicator/5fa9098"}],
                      200000,
                      [{socket_options,[{keepalive,true},{nodelay,false}]}],
                      1,250,nil,1},
                  [{checkpoint_interval,5000},
                   {connection_timeout,200000},
                   {continuous,false},
                   {http_connections,1},
                   {retries,1},
                   {socket_options,[{keepalive,true},{nodelay,false}]},
                   {use_checkpoints,true},
                   {worker_batch_size,500},
                   {worker_processes,1}],
                  {user_ctx,null,[],undefined},
                  db,nil,
                  <<"rdyno_0001"...(15 B)>>,
                  <<"shards/a00"...(47 B)>>},
              [{timeout,200000}]]},
         transient,250,worker,
         [couch_replicator]},
     {child,undefined,"6c48c1ab7a6e3ed5e3d4415ced912e4a",
         {gen_server,start_link,
             [couch_replicator,
              {rep,
                  {"6c48c1ab7a6e3ed5e3d4415ced912e4a",[]},
                  {httpdb,"http://adm:pass@localhost:5984/rdyno_src_0001/";,
                      nil,
                      [{"Accept","application/json"},
                       {"User-Agent","CouchDB-Replicator/5fa9098"}],
                      200000,
                      [{socket_options,[{keepalive,true},{nodelay,false}]}],
                      1,250,nil,1},
                  {httpdb,"http://adm:pass@localhost:5984/rdyno_tgt_0002/";,
                      nil,
                      [{"Accept","application/json"},
                       {"User-Agent","CouchDB-Replicator/5fa9098"}],
                      200000,
                      [{socket_options,[{keepalive,true},{nodelay,false}]}],
                      1,250,nil,1},
                  [{checkpoint_interval,5000},
                   {connection_timeout,200000},
                   {continuous,false},
                   {http_connections,1},
                   {retries,1},
                   {socket_options,[{keepalive,true},{nodelay,false}]},
                   {use_checkpoints,true},
                   {worker_batch_size,500},
                   {worker_processes,1}],
                  {user_ctx,null,[],undefined},
                  db,nil,
                  <<"rdyno_0001"...(15 B)>>,
                  <<"shards/200"...(47 B)>>},
              [{timeout,200000}]]},
         transient,250,worker,
         [couch_replicator]}],
    undefined,100,1,[],couch_replicator_job_sup,[]}
{code}



> Automatically restart replication jobs if they crash
> ----------------------------------------------------
>
>                 Key: COUCHDB-2975
>                 URL: https://issues.apache.org/jira/browse/COUCHDB-2975
>             Project: CouchDB
>          Issue Type: Improvement
>          Components: Replication
>            Reporter: Robert Newson
>
> We currently use the temporary restart strategy for replication jobs, which 
> means if they crash they are not restarted.
> Instead, let's use the transient restart strategy, ensuring they are 
> restarted on abnormal termination, while still allowing these tasks to end 
> successfully on completion or cancellation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (COUCHDB-2975) Automatically restart replication jobs if they crash

Reply via email to