[jira] [Commented] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop

2016-10-12 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570354#comment-15570354
 ] 

Zameer Manji commented on AURORA-1789:
--

I have updated the title and assignee to reflect reality. Thanks for 
investigating and self serving [~jpinkul]!

> Incorrect --mesos_containerizer_path value results in thermos failure loop
> --
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop

2016-10-12 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1789:
-
Summary: Incorrect --mesos_containerizer_path value results in thermos 
failure loop  (was: namespaces/pid isolator causes lost process)

> Incorrect --mesos_containerizer_path value results in thermos failure loop
> --
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information

2016-10-12 Thread John Sirois (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570282#comment-15570282
 ] 

John Sirois commented on AURORA-1785:
-

I should have s/too much/redundant/ - in the only use case you'd get 
everythingbut hostname duplicated across each cat'ed node.  Agreed though its 
at the very least harmless.

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information

2016-10-12 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570275#comment-15570275
 ] 

Zameer Manji commented on AURORA-1785:
--

I don't think it's "too much', it is exactly what the leader would advertise.

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2016-10-12 Thread David McLaughlin (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David McLaughlin reopened AURORA-1225:
--

Reopening this. Bugs were found in the implementation. We will submit again 
with improved testing. 

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2016-10-12 Thread David McLaughlin (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David McLaughlin resolved AURORA-1793.
--
Resolution: Fixed

The commits have been reverted on master. 

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
> compatible. We decided to revert this commit.
> The changes that directly causes problems is:
> {code}
> Modify executor state transition logic to rely on health checks (if enabled).
> commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
> {code}
> There are two downstream commits that depends on the above commit:
> {code}
> Add min_consecutive_health_checks in HealthCheckConfig
> commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
> {code}
> {code}
> Add support for receiving min_consecutive_successes in health checker
> commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
> {code}
> We will drop all three of these commits and revert back to one commit before 
> the problematic commit:
> {code}
> Running task ssh without an instance should pick a random instance
> commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread David McLaughlin (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David McLaughlin resolved AURORA-1791.
--
Resolution: Fixed

Commits reverted on master. 

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2016-10-12 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569886#comment-15569886
 ] 

David McLaughlin commented on AURORA-1793:
--

https://reviews.apache.org/r/52806/

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
> compatible. We decided to revert this commit.
> The changes that directly causes problems is:
> {code}
> Modify executor state transition logic to rely on health checks (if enabled).
> commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
> {code}
> There are two downstream commits that depends on the above commit:
> {code}
> Add min_consecutive_health_checks in HealthCheckConfig
> commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
> {code}
> {code}
> Add support for receiving min_consecutive_successes in health checker
> commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
> {code}
> We will drop all three of these commits and revert back to one commit before 
> the problematic commit:
> {code}
> Running task ssh without an instance should pick a random instance
> commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1789) namespaces/pid isolator causes lost process

2016-10-12 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569686#comment-15569686
 ] 

Justin Pinkul commented on AURORA-1789:
---

This review catches this error and throws a useful mesage. 
https://reviews.apache.org/r/52804/

> namespaces/pid isolator causes lost process
> ---
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569618#comment-15569618
 ] 

Kai Huang commented on AURORA-1791:
---

The ticket to track is:  https://issues.apache.org/jira/browse/AURORA-1793

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2016-10-12 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1793:
--
Assignee: Kai Huang

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
> compatible. We decided to revert this commit.
> The changes that directly causes problems is:
> {code}
> Modify executor state transition logic to rely on health checks (if enabled).
> commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
> {code}
> There are two downstream commits that depends on the above commit:
> {code}
> Add min_consecutive_health_checks in HealthCheckConfig
> commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
> {code}
> {code}
> Add support for receiving min_consecutive_successes in health checker
> commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
> {code}
> We will drop all three of these commits and revert back to one commit before 
> the problematic commit:
> {code}
> Running task ssh without an instance should pick a random instance
> commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2016-10-12 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1793:
--
Description: 
The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
compatible. We decided to revert this commit.

The changes that directly causes problems is:
{code}
Modify executor state transition logic to rely on health checks (if enabled).
commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
{code}

There are two downstream commits that depends on the above commit:
{code}
Add min_consecutive_health_checks in HealthCheckConfig
commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
{code}
{code}
Add support for receiving min_consecutive_successes in health checker
commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
{code}
We will drop all three of these commits and revert back to one commit before 
the problematic commit:
{code}
Running task ssh without an instance should pick a random instance
commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
{code}

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Priority: Blocker
>
> The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
> compatible. We decided to revert this commit.
> The changes that directly causes problems is:
> {code}
> Modify executor state transition logic to rely on health checks (if enabled).
> commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
> {code}
> There are two downstream commits that depends on the above commit:
> {code}
> Add min_consecutive_health_checks in HealthCheckConfig
> commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
> {code}
> {code}
> Add support for receiving min_consecutive_successes in health checker
> commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
> {code}
> We will drop all three of these commits and revert back to one commit before 
> the problematic commit:
> {code}
> Running task ssh without an instance should pick a random instance
> commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1793) Revert

2016-10-12 Thread Kai Huang (JIRA)
Kai Huang created AURORA-1793:
-

 Summary: Revert 
 Key: AURORA-1793
 URL: https://issues.apache.org/jira/browse/AURORA-1793
 Project: Aurora
  Issue Type: Bug
Reporter: Kai Huang
Priority: Blocker






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2016-10-12 Thread Kai Huang (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kai Huang updated AURORA-1793:
--
Summary: Revert Commit ca683 which is not backwards compatible  (was: 
Revert )

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Priority: Blocker
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569600#comment-15569600
 ] 

Kai Huang commented on AURORA-1791:
---

We've decided to revert the commit. 

The changes that directly causes problems is:

Modify executor state transition logic to rely on health checks (if enabled).
commit ca683cb9e27bae76424a687bc6c3af5a73c501b9

There are two downstream commits that depends on the above commit:

Add min_consecutive_health_checks in HealthCheckConfig
commit ed72b1bf662d1e29d2bb483b317c787630c26a9e

Add support for receiving min_consecutive_successes in health checker
commit e91130e49445c3933b6e27f5fde18c3a0e61b87a

We will drop all three of these commits and revert back to one commit before 
the problematic commit:
Running task ssh without an instance should pick a random instance
commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c

I will create a separate ticket for people to track the reversion.



> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1504) aurora job inspect should have a --write-json option

2016-10-12 Thread Jing Chen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jing Chen reassigned AURORA-1504:
-

Assignee: Jing Chen

> aurora job inspect should have a --write-json option
> 
>
> Key: AURORA-1504
> URL: https://issues.apache.org/jira/browse/AURORA-1504
> Project: Aurora
>  Issue Type: Story
>  Components: Client
>Reporter: brian wickman
>Assignee: Jing Chen
>
> {{aurora update start}} has a {{--read-json}} option, but there's no way with 
> the client to actually synthesize compatible json from a job_key / config 
> pair.
> we should have {{aurora job inspect --write-json}} or possibly new {{aurora 
> config read/write}} commands that allow users to build better automation 
> stories around reified configurations as json blobs.
> the complications here are binding helpers that add metadata to 
> JobConfiguration which get lost.  we might need a higher-level json schema 
> that contains these extra fields.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (AURORA-1737) Descheduling a cron job checks role access before job key existence

2016-10-12 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen resolved AURORA-1737.
--
Resolution: Cannot Reproduce

I can no longer reproduce this problem. Not sure what changed to fix it, but 
resolving as cannot reproduce. If it comes back we can re-open.

> Descheduling a cron job checks role access before job key existence
> ---
>
> Key: AURORA-1737
> URL: https://issues.apache.org/jira/browse/AURORA-1737
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Joshua Cohen
>Assignee: Jing Chen
>Priority: Minor
>
> Trying to deschedule a cron job for a non-existent role returns a permission 
> error rather than a no-such-job error. This leads to confusion for users in 
> the event of a typo in the role.
> Given that jobs are world-readable, we should check for a valid job key 
> before applying permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information

2016-10-12 Thread John Sirois (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15568927#comment-15568927
 ] 

John Sirois commented on AURORA-1785:
-

Yes - its too much info, but enough.  In particular the (2) {{"host": 
"aurora.local"}} entries give the contender hostname - which should be unique 
across election candidate machines.

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1785) Populate curator latches with scheduler information

2016-10-12 Thread Jing Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567793#comment-15567793
 ] 

Jing Chen edited comment on AURORA-1785 at 10/12/16 8:02 AM:
-

Is ServerSet information enough to identify uniquely contender? The information 
is like:
{noformat}
{"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"http":{"host":"aurora.local","port":8081}},"status":"ALIVE"}
{noformat}


was (Author: jingc):
Is ServerSet information enough to identify uniquely contender? The information 
is like:
{"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"http":{"host":"aurora.local","port":8081}},"status":"ALIVE"}

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1737) Descheduling a cron job checks role access before job key existence

2016-10-12 Thread Jing Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567979#comment-15567979
 ] 

Jing Chen commented on AURORA-1737:
---

Hi Joshua, I create a cron job as following:
{code:title=cron_hello_jing.aurora|borderStyle=solid}
jobs = [
  Job(
cluster = 'devcluster',
role = 'www-data',
environment = 'devel',
name = 'cron_hello_jing',
cron_schedule = '*/5 * * * *',
cron_collision_policy='CANCEL_NEW',
task = Task(
  name="cron_hello_jing",
  processes=[Process(name="hello_jing",
   cmdline="echo 'hello jing'")],
  resources=Resources(cpu=1, ram=1*MB, disk=8*MB)
)
  )
]
{code}
However, when I try to deschedule the cron job with a non-existent role
{noformat}
aurora cron deschedule devcluster/abc/devel/cron_hello_jing
{noformat}
It returns as:
{quote}
 INFO] Removing cron schedule for job 
devcluster/{color:red}abc{color}/devel/cron_hello_world
 INFO] Job abc/devel/cron_hello_world is not scheduled with cron
Cron descheduling succeeded.
{quote}

Can you tell me how to reproduce the bug? 

Thanks
Jing

> Descheduling a cron job checks role access before job key existence
> ---
>
> Key: AURORA-1737
> URL: https://issues.apache.org/jira/browse/AURORA-1737
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Joshua Cohen
>Assignee: Jing Chen
>Priority: Minor
>
> Trying to deschedule a cron job for a non-existent role returns a permission 
> error rather than a no-such-job error. This leads to confusion for users in 
> the event of a typo in the role.
> Given that jobs are world-readable, we should check for a valid job key 
> before applying permissions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567954#comment-15567954
 ] 

David McLaughlin edited comment on AURORA-1791 at 10/12/16 7:50 AM:


Given the lack of test coverage I've found just looking at a single function, I 
would seriously recommend we roll back the commit (or will it be commits?) 
rather than rush a patch in order to fix master. Any objections? cc/ [~zmanji] 
and [~joshua.cohen]] 


was (Author: davmclau):
Given the lack of test coverage I've found just looking at a single function, I 
would seriously recommend we roll back the commit (or will it be commits?) 
rather than rush a patch in order to fix master. Any objections? cc/ [~zmanji] 
and [~jcohen] 

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567954#comment-15567954
 ] 

David McLaughlin commented on AURORA-1791:
--

Given the lack of test coverage I've found just looking at a single function, I 
would seriously recommend we roll back the commit (or will it be commits?) 
rather than rush a patch in order to fix master. Any objections? cc/ [~zmanji] 
and [~jcohen] 

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread David McLaughlin (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567886#comment-15567886
 ] 

David McLaughlin commented on AURORA-1791:
--

I don't think so. There are easy-to-fix errors in this code without having to 
move to any event-driven approach. 

The log lines that Zameer posted are triggered during a run in which the health 
checker is reporting healthy. The main logic error is that we take a 
*pessimistic approach* to checking interval expiration. 

Specifically, this block:

{code}
if not self._expired:
  if self.clock.time() - self.start_time > self.initial_interval:
log.debug('Initial interval expired.')
self._expired = True
if not self.health_check_passed:
  log.warning('Failed to reach minimum consecutive successes.')
  self.healthy = False
  else:
if self.current_consecutive_successes >= self.min_consecutive_successes:
  log.info('Reached minimum consecutive successes.')
  self.health_check_passed = True
{code}

We could be in a situation where current_consecutive_successes meets the 
minimum criteria but we decide to expire if we're even a millisecond over the 
interval.  You could rewrite this as:

{code}
if not self._expired:
if self.current_consecutive_successes >= self.min_consecutive_successes:
  log.info('Reached minimum consecutive successes.')
  self.health_check_passed = True
   
if self.clock.time() - self.start_time > self.initial_interval:
  log.debug('Initial interval expired.')
  self._expired = True
  if not self.health_check_passed:
log.warning('Failed to reach minimum consecutive successes.')
self.healthy = False
{code}

And I think as long as the current healthiness meets the minimum consecutive 
successes, the task would enter RUNNING state. 

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567768#comment-15567768
 ] 

Kai Huang edited comment on AURORA-1791 at 10/12/16 6:43 AM:
-

To sum up, the issue is caused by failed to reach min_consecutive_successes, 
not exceeding max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but ignores it until 
initial_interval_secs expires. This does not cause any problem but does not 
seem clear to people. I've changed it to:  updating failure counter after 
initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options 
here:

(a) Doing health checks periodically as defined. Even initial_interval_secs 
expires and min successes is not reached (because periodic check will miss some 
successes), we do not fail health check right away. Instead, we will rely on 
the latest health check to ensure the task has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption 
that if a task responds OK before initial_interval_secs expires, for next 
health check, it will still responds OK. However, it's likely the task fails to 
respond OK until we perform this additional health check. It's highly likely 
the instance will be healthy afterwards, but we should fail the health check 
according to the definition?


was (Author: kaih):
To sum up, the issue is caused by failed to reach min_consecutive_successes, 
not exceeding max_consecutive_failures. 

In commit ca683, I keep updating the failure counter but only ignores it until 
initial_interval_secs expires. This does not cause any problem but does not 
seem clear to people. I've changed it to:  updating failure counter after 
initial_interval_secs expires.

For the root cause of the issue, min_consecutive_successes, we have two options 
here:

(a) Doing health checks periodically as defined. Even initial_interval_secs 
expires and min successes is not reached (because periodic check will miss some 
successes), we do not fail health check right away. Instead, we will rely on 
the latest health check to ensure the task has already been in healthy state. 

(b) Doing an additional health check whenever initial_interval_secs expires.

In my recent review request, I implemented (a). This is based on the assumption 
that if a task responds OK before initial_interval_secs expires, for next 
health check, it will still responds OK. However, it's likely the task fails to 
respond OK until we perform this additional health check. It's highly likely 
the instance will be healthy afterwards, but we should fail the health check 
according to the definition?

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-12 Thread Kai Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567803#comment-15567803
 ] 

Kai Huang commented on AURORA-1791:
---

An issue to implement (b) is that the health checker thread might be sleeping 
while initial_interval_secs expires.

We will need a event-driven mechanism to notify the health checker to wake up 
and do a health check when initial_interval_secs expires. This seems requires a 
lot of refactoring.

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1785) Populate curator latches with scheduler information

2016-10-12 Thread Jing Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567793#comment-15567793
 ] 

Jing Chen edited comment on AURORA-1785 at 10/12/16 6:36 AM:
-

Is ServerSet information enough to identify uniquely contender? The information 
is like:
'''{"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"http":{"host":"aurora.local","port":8081}},"status":"ALIVE"}'''
 


was (Author: jingc):
from what I get, is ServerSet information 
`{"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"http":{"host":"aurora.local","port":8081}},"status":"ALIVE"}`
  enough to identify uniquely contender?

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (AURORA-1785) Populate curator latches with scheduler information

2016-10-12 Thread Jing Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567793#comment-15567793
 ] 

Jing Chen edited comment on AURORA-1785 at 10/12/16 6:37 AM:
-

Is ServerSet information enough to identify uniquely contender? The information 
is like:
{"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"http":{"host":"aurora.local","port":8081}},"status":"ALIVE"}


was (Author: jingc):
Is ServerSet information enough to identify uniquely contender? The information 
is like:
'''{"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"http":{"host":"aurora.local","port":8081}},"status":"ALIVE"}'''
 

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information

2016-10-12 Thread Jing Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15567793#comment-15567793
 ] 

Jing Chen commented on AURORA-1785:
---

from what I get, is ServerSet information 
`{"serviceEndpoint":{"host":"aurora.local","port":8081},"additionalEndpoints":{"http":{"host":"aurora.local","port":8081}},"status":"ALIVE"}`
  enough to identify uniquely contender?

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)