[jira] [Updated] (AURORA-1788) vagrant up does not properly configure network adapters

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1788:

Fix Version/s: 0.17.0

> vagrant up does not properly configure network adapters
> ---
>
> Key: AURORA-1788
> URL: https://issues.apache.org/jira/browse/AURORA-1788
> Project: Aurora
>  Issue Type: Bug
>Reporter: Andrew Jorgensen
>Assignee: Andrew Jorgensen
> Fix For: 0.17.0
>
>
> I am not sure of the specifics of why this happens but on vagrant 1.8.6 the 
> network interface does not come up correctly and the private_network is 
> attached to the eth0 nat interface rather than the host-only interface. I 
> tried a number of different parameters but none of them were able to 
> configure the network appropriately. This change manually configures the 
> static ip so that it is connected to the correct adapter. Without this change 
> I could not access the aurora web interface when running vagrant up.
> I've created a patch here: https://reviews.apache.org/r/52609/
> This is what the configuration looks like when run off master:
> {code}
> ip addr
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff
> inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:feb3:1b30/64 scope link
>valid_lft forever preferred_lft forever
> 3: eth1:  mtu 1500 qdisc pfifo_fast state 
> DOWN group default
> link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff
> 4: docker0:  mtu 1500 qdisc noqueue state 
> DOWN group default
> link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
>valid_lft forever preferred_lft forever
> {code}
> here is what it is supposed to look like:
> {code}
> ip addr
> 1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group 
> default
> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> inet 127.0.0.1/8 scope host lo
>valid_lft forever preferred_lft forever
> inet6 ::1/128 scope host
>valid_lft forever preferred_lft forever
> 2: eth0:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff
> inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:feb3:1b30/64 scope link
>valid_lft forever preferred_lft forever
> 3: eth1:  mtu 1500 qdisc pfifo_fast state UP 
> group default qlen 1000
> link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff
> inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1
>valid_lft forever preferred_lft forever
> inet6 fe80::a00:27ff:fe7c:4e72/64 scope link
>valid_lft forever preferred_lft forever
> 4: docker0:  mtu 1500 qdisc noqueue state 
> DOWN group default
> link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff
> inet 172.17.0.1/16 scope global docker0
>valid_lft forever preferred_lft forever
> {code}
> Steps to reproduce:
> 1. Update to vagrant 1.8.6 (unsure if previous versions are affected as well)
> 2. Run `vagrant up`
> 3. Try to visit http://192.168.33.7:8081
> Expected outcome:
> I expect that following the steps in 
> http://aurora.apache.org/documentation/latest/getting-started/vagrant/ I 
> would be able to visit the web interface for aurora.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1224:

Fix Version/s: 0.17.0

> Add a new "min_consecutive_health_checks" setting in .aurora config
> ---
>
> Key: AURORA-1224
> URL: https://issues.apache.org/jira/browse/AURORA-1224
> Project: Aurora
>  Issue Type: Task
>  Components: Client, Scheduler
>Reporter: Maxim Khutornenko
>Assignee: Kai Huang
> Fix For: 0.17.0
>
>
> HealthCheckConfig should accept a new configuration value that will tell how 
> many positive consecutive health checks an instance requires to move from 
> STARTING to RUNNING.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1786) -zk_session_timeout option does not work

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1786:

Fix Version/s: 0.17.0

> -zk_session_timeout option does not work
> 
>
> Key: AURORA-1786
> URL: https://issues.apache.org/jira/browse/AURORA-1786
> Project: Aurora
>  Issue Type: Bug
>Reporter: David Robinson
> Fix For: 0.17.0
>
>
> Looks like the -zk_session_timeout option has no affect. I've set 
> -zk_session_timeout="60mins" to attempt to work around ZK session timeouts 
> (due to GC pauses caused by TaskHistoryPruner pruning a huge number of 
> inactive tasks), but the default 30 seconds seems to always be used.
> {noformat}
> I0929 22:36:10.804 [main, ArgScanner:411] zk_chroot_path: null 
> I0929 22:36:10.804 [main, ArgScanner:411] zk_digest_credentials: : 
> I0929 22:36:10.805 [main, ArgScanner:411] zk_endpoints: [zk.example.com:2181] 
> I0929 22:36:10.805 [main, ArgScanner:411] zk_in_proc: false 
> I0929 22:36:10.805 [main, ArgScanner:411] zk_session_timeout: (30, mins) 
> I0929 22:36:10.805 [main, ArgScanner:411] zk_use_curator: true 
> {noformat}
> {noformat}
> I0929 22:48:37.678 [AsyncProcessor-3, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> I0929 22:48:37.738 [AsyncProcessor-5, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> 2016-09-29 
> 22:48:37,794:47040(0x7f07f4c3c940):ZOO_WARN@zookeeper_interest@1570: Exceeded 
> deadline by 12ms
> I0929 22:48:37.805 [AsyncProcessor-0, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> I0929 22:48:37.814 [AsyncProcessor-6, MemTaskStore:148] Query took 588 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:37.867 [AsyncProcessor-1, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> I0929 22:48:37.873 [AsyncProcessor-2, MemTaskStore:148] Query took 304 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:37.875 [AsyncProcessor-7, MemTaskStore:148] Query took 289 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:37.886 [AsyncProcessor-4, TaskHistoryPruner:137] Pruning inactive 
> tasks 
> [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, 
> mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, 
> mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, 
> mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5
> I0929 22:48:38.045 [AsyncProcessor-3, MemTaskStore:148] Query took 359 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:38.152 [AsyncProcessor-5, MemTaskStore:148] Query took 405 ms: 
> ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], 
> statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], 
> jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], 
> offset=0, limit=0} 
> I0929 22:48:38.407 [AsyncProcessor-0, MemTaskStore:148] 

[jira] [Updated] (AURORA-1878) Increased executor logs can lead to task's running out of disk space

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1878:

Fix Version/s: 0.17.0

> Increased executor logs can lead to task's running out of disk space
> 
>
> Key: AURORA-1878
> URL: https://issues.apache.org/jira/browse/AURORA-1878
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Joshua Cohen
>Assignee: Joshua Cohen
> Fix For: 0.17.0
>
>
> After the health check for updates patch, this log statement is being emitted 
> once every 500ms: 
> https://github.com/apache/aurora/commit/2992c8b4#diff-6d60c873330419a828fb992f46d53372R121
> This is due to this 
> [code|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/status_checker.py#L120-L124]:
> {code}
> if status_result is not None:
>   log.info('%s reported %s' % (status_checker.__class__.__name__, 
> status_result))
> {code}
> Previously, {{status_result}} would be {{None}} unless the status checker had 
> a terminal event. Now, {{status_result}} will always be set, but we only 
> consider the {{status_result}} to be terminal if the {{status}} is not 
> {{TASK_STARTING}} or {{TASK_RUNNING}}. So, for the healthy case, we log that 
> the task is {{TASK_RUNNING}} every 500ms.
> !https://frinkiac.com/meme/S10E02/818984.jpg?b64lines=IFRISVMgV0lMTCBTT1VORCBFVkVSWQogVEhSRUUgU0VDT05EUyBVTkxFU1MKIFNPTUVUSElORyBJU04nVCBPS0FZIQ==!



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1861) Remove duplicate Snapshot fields for DB stores

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1861:

Fix Version/s: 0.17.0

> Remove duplicate Snapshot fields for DB stores
> --
>
> Key: AURORA-1861
> URL: https://issues.apache.org/jira/browse/AURORA-1861
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
> Fix For: 0.17.0
>
> Attachments: select-all-job-update-details time.png, 
> snapshot-create-time-only.png, snapshot-total-time.png
>
>
> Currently we double-write any DB-backed stores into a Snapshot struct when 
> creating a Snapshot. This inflates the size of the Snapshot, which is already 
> a problem for large production clusters (see AURORA-74). 
> Example for LockStore from 
> https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java:
> {code}
>   new SnapshotField() {
> // It's important for locks to be replayed first, since there are 
> relations that expect
> // references to be valid on insertion.
> @Override
> public void saveToSnapshot(MutableStoreProvider store, Snapshot 
> snapshot) {
>   
> snapshot.setLocks(ILock.toBuildersSet(store.getLockStore().fetchLocks()));
> }
> @Override
> public void restoreFromSnapshot(MutableStoreProvider store, Snapshot 
> snapshot) {
>   if (hasDbSnapshot(snapshot)) {
> LOG.info("Deferring lock restore to dbsnapshot");
> return;
>   }
>   store.getLockStore().deleteLocks();
>   if (snapshot.isSetLocks()) {
> for (Lock lock : snapshot.getLocks()) {
>   store.getLockStore().saveLock(ILock.build(lock));
> }
>   }
> }
>   },
> {code}
> The saveToSnapshot here is totally redundant as the entire H2 database is 
> dumped into the dbScript field. 
> Note: one major side-effect here is if anyone is trying to read these 
> snapshots and utilize the data outside of Java - they'll lose the ability to 
> process the data without being able to apply the DB script. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1792) Executor does not log full task information.

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1792:

Fix Version/s: 0.17.0

> Executor does not log full task information.
> 
>
> Key: AURORA-1792
> URL: https://issues.apache.org/jira/browse/AURORA-1792
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Fix For: 0.17.0
>
>
> I launched a task that has an {{initial_interval_secs}} in the health check 
> config. However the log contains no information about this field:
> {noformat}
> $ grep "initial_interval_secs" __main__.log
> {noformat}
> We should log the entire ExecutorInfo blob.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1541) Observer logs are noisy

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1541:

Fix Version/s: 0.17.0

> Observer logs are noisy
> ---
>
> Key: AURORA-1541
> URL: https://issues.apache.org/jira/browse/AURORA-1541
> Project: Aurora
>  Issue Type: Bug
>  Components: Observer
>Reporter: David Robinson
>Assignee: Stephan Erb
>Priority: Minor
> Fix For: 0.17.0
>
>
> The observer's logs consist of lots of warnings about being unable to find 
> PIDs. This is likely due to the checkpoint pointing to PIDs that have been 
> cleaned by Mesos.
> {noformat}
> W1117 20:11:38.103549 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 39594
> W1117 20:11:38.151583 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 14012
> W1117 20:11:38.232773 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 26565
> W1117 20:11:38.486680 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 44902
> W1117 20:11:38.612293 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 32871
> W1117 20:11:38.694812 33983 process_collector_psutil.py:76] Error during 
> process sampling: no process found with pid 7182
> {noformat}
> The warning messages should probably be debug messages, since Mesos cleaning 
> sandboxes is an expected operation.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1789:

Fix Version/s: 0.17.0

> Incorrect --mesos_containerizer_path value results in thermos failure loop
> --
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
> Fix For: 0.17.0
>
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1793:

Fix Version/s: 0.17.0

> Revert Commit ca683 which is not backwards compatible
> -
>
> Key: AURORA-1793
> URL: https://issues.apache.org/jira/browse/AURORA-1793
> Project: Aurora
>  Issue Type: Bug
>Reporter: Kai Huang
>Assignee: Kai Huang
>Priority: Blocker
> Fix For: 0.17.0
>
>
> The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards 
> compatible. We decided to revert this commit.
> The changes that directly causes problems is:
> {code}
> Modify executor state transition logic to rely on health checks (if enabled).
> commit ca683cb9e27bae76424a687bc6c3af5a73c501b9
> {code}
> There are two downstream commits that depends on the above commit:
> {code}
> Add min_consecutive_health_checks in HealthCheckConfig
> commit ed72b1bf662d1e29d2bb483b317c787630c26a9e
> {code}
> {code}
> Add support for receiving min_consecutive_successes in health checker
> commit e91130e49445c3933b6e27f5fde18c3a0e61b87a
> {code}
> We will drop all three of these commits and revert back to one commit before 
> the problematic commit:
> {code}
> Running task ssh without an instance should pick a random instance
> commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1798:

Fix Version/s: 0.17.0

> resolv.conf is not copied when using the Mesos containerizer with a Docker 
> image
> 
>
> Key: AURORA-1798
> URL: https://issues.apache.org/jira/browse/AURORA-1798
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
> Fix For: 0.17.0
>
>
> When Thermos launches a task using a Docker image it mounts the image as a 
> volume and manually chroots into it. One consequence of this is the logic 
> inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the 
> host into the new rootfs is bypassed. The Thermos executor should manually 
> copy this file into the rootfs until Mesos pod support is implemented.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1225:

Fix Version/s: 0.17.0

> Modify executor state transition logic to rely on health checks (if enabled)
> 
>
> Key: AURORA-1225
> URL: https://issues.apache.org/jira/browse/AURORA-1225
> Project: Aurora
>  Issue Type: Task
>  Components: Executor
>Reporter: Maxim Khutornenko
>Assignee: Santhosh Kumar Shanmugham
> Fix For: 0.17.0
>
>
> Executor needs to start executing user content in STARTING and transition to 
> RUNNING when a successful required number of health checks is reached.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1791) Commit ca683 is not backwards compatible.

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1791:

Fix Version/s: 0.17.0

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
> Fix For: 0.17.0
>
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1795) Internal server error in scheduler Thrift API on missing Content-Type

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1795:

Fix Version/s: 0.17.0

> Internal server error in scheduler Thrift API on missing Content-Type
> -
>
> Key: AURORA-1795
> URL: https://issues.apache.org/jira/browse/AURORA-1795
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.16.0
>Reporter: Stephan Erb
>Assignee: Zameer Manji
> Fix For: 0.17.0
>
>
> This happens if a user has a very old browser, i.e. Firefox 41.
> {code}
> I1017 09:38:15.618 [qtp1426166274-44336, Slf4jRequestLog:60] 10.x.x.x - - 
> [17/Oct/2016:09:38:15 +] "POST //foobar.example.org/api HTTP/1.1" 200 794
> W1017 09:38:15.627 [qtp1426166274-44066, ServletHandler:631] /api 
> java.lang.NullPointerException: null
> at java.util.Objects.requireNonNull(Objects.java:203) 
> ~[na:1.8.0-internal]
> at java.util.Optional.(Optional.java:96) ~[na:1.8.0-internal]
> at java.util.Optional.of(Optional.java:108) ~[na:1.8.0-internal]
> at 
> org.apache.aurora.scheduler.http.api.TContentAwareServlet.doPost(TContentAwareServlet.java:123)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.api.TContentAwareServlet.doGet(TContentAwareServlet.java:164)
>  ~[aurora-0.16.0.jar:na]
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) 
> ~[javax.servlet-api-3.1.0.jar:3.1.0]
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) 
> ~[javax.servlet-api-3.1.0.jar:3.1.0]
> at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.LeaderRedirectFilter.doFilter(LeaderRedirectFilter.java:72)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44)
>  ~[aurora-0.16.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.HttpStatsFilter.doFilter(HttpStatsFilter.java:71)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44)
>  ~[aurora-0.16.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> 

[jira] [Updated] (AURORA-655) Order job update events and instance events by ID rather than timestamp

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-655:
---
Fix Version/s: 0.17.0

> Order job update events and instance events by ID rather than timestamp
> ---
>
> Key: AURORA-655
> URL: https://issues.apache.org/jira/browse/AURORA-655
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Bill Farner
>Assignee: Jing Chen
>Priority: Trivial
>  Labels: newbie
> Fix For: 0.17.0
>
>
> In {{JobUpdateDetailsMapper.xml}} we order by timestamps, which could be 
> brittle if the system time changes.  Instead of using the timestamp, use the 
> built-in database {{IDENTITY}} for sort order.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1684) Cron tasks are sanitized multiple times (once when being created via the API, and again when actually being triggered)

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1684:

Fix Version/s: 0.17.0

> Cron tasks are sanitized multiple times (once when being created via the API, 
> and again when actually being triggered)
> --
>
> Key: AURORA-1684
> URL: https://issues.apache.org/jira/browse/AURORA-1684
> Project: Aurora
>  Issue Type: Bug
>Reporter: Steve Niemitz
>Assignee: Steve Niemitz
> Fix For: 0.17.0
>
>
> This can cause issues in the following scenario:
> - An operator sets default_docker_parameters on the scheduler
> - The operator DOES NOT allow docker paramters (via allow_docker_parameters)
> - A user schedules a cron job using a docker container.
> Because the first pass of ConfigurationManager.validateAndPopulate will 
> mutate the task to have docker parameters (the defaults), the second pass in 
> SanitizedCronJob.fromUnsanitized will fail validation.
> A solution here may be to remove fromUnsanitized and instead pass the job 
> configuration directly, since we know it will always be safe.
> {code}
> W0427 17:01:35.286 [QuartzScheduler_Worker-5, AuroraCronJob:134] Invalid cron 
> job for IJobKey{role=tcdc-infra, environment=prod, 
> name=security-group-alerter} in storage - failed to parse with {} 
> org.apache.aurora.scheduler.configuration.ConfigurationManager$TaskDescriptionException:
>  Docker parameters not allowed.
>   at 
> org.apache.aurora.scheduler.configuration.ConfigurationManager.validateAndPopulate(ConfigurationManager.java:249)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.configuration.ConfigurationManager.validateAndPopulate(ConfigurationManager.java:166)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.configuration.SanitizedConfiguration.fromUnsanitized(SanitizedConfiguration.java:60)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.cron.SanitizedCronJob.(SanitizedCronJob.java:45)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.cron.SanitizedCronJob.fromUnsanitized(SanitizedCronJob.java:102)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$doExecute$163(AuroraCronJob.java:132)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.log.LogStorage.lambda$doInTransaction$222(LogStorage.java:524)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage.transactionedWrite(DbStorage.java:160)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.CGLIB$transactionedWrite$2()
>  ~[guice-3.0.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4$$FastClassByGuice$$e3e3ff55.invoke()
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
>  ~[guice-3.0.jar:na]
>   at 
> org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101)
>  ~[mybatis-guice-3.7.jar:3.7]
>   at 
> com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52)
>  ~[guice-3.0.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.transactionedWrite()
>  ~[guice-3.0.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$188(DbStorage.java:174)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62)
>  ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:172) 
> ~[aurora-0.13.0-SNAPSHOT.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.CGLIB$write$3()
>  ~[guice-3.0.jar:na]
>   at 
> org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4$$FastClassByGuice$$e3e3ff55.invoke()
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228)
>  ~[guice-3.0.jar:na]
>   at 
> com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72)
>  

[jira] [Updated] (AURORA-1794) Scheduler fails to start if -enable_revocable_ram is toggled

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1794:

Fix Version/s: 0.17.0

> Scheduler fails to start if -enable_revocable_ram is toggled
> 
>
> Key: AURORA-1794
> URL: https://issues.apache.org/jira/browse/AURORA-1794
> Project: Aurora
>  Issue Type: Story
>Affects Versions: 0.16.0
>Reporter: Stephan Erb
>Assignee: Stephan Erb
> Fix For: 0.17.0
>
>
> The scheduler does not start if {{-enable_revocable_ram}} is set:
> {code}
> Exception in thread "main" java.lang.IllegalStateException: A value cannot be 
> changed after it was read.
> at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:174)
> at org.apache.aurora.common.args.Arg.set(Arg.java:54)
> at 
> org.apache.aurora.common.args.ArgumentInfo.setValue(ArgumentInfo.java:128)
> at org.apache.aurora.common.args.OptionInfo.load(OptionInfo.java:131)
> at 
> org.apache.aurora.common.args.ArgScanner.process(ArgScanner.java:368)
> at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:200)
> at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:178)
> at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:155)
> at 
> org.apache.aurora.scheduler.app.SchedulerMain.applyStaticArgumentValues(SchedulerMain.java:226)
> at 
> org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:197)
> {code}
> This is an unfortunate oversight at my end. When introducing the feature, I 
> deferred the e2e test. It 'worked' in a manual test - at least that is what I 
> believed. Probably, I had only added the flag to the config in the repo, but 
> not to the one that was actually started in vagrant.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1880) How to set the environment variable for Mesos Containerizer?

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1880:

Fix Version/s: 0.17.0

> How to set the environment variable for Mesos Containerizer?
> 
>
> Key: AURORA-1880
> URL: https://issues.apache.org/jira/browse/AURORA-1880
> Project: Aurora
>  Issue Type: Bug
>Affects Versions: 0.16.0, 0.15.0
>Reporter: jackyoh
> Fix For: 0.17.0
>
>
> I'm running a Docker on an Aurora framework.
> The question is: how to set the environment variable for Mesos Containerizer?
> For example:
> docker run -e ENV1=env1 ...



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1110) Running task ssh without an instance should pick a random instance

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1110:

Fix Version/s: 0.17.0

> Running task ssh without an instance should pick a random instance
> --
>
> Key: AURORA-1110
> URL: https://issues.apache.org/jira/browse/AURORA-1110
> Project: Aurora
>  Issue Type: Story
>  Components: Client
>Reporter: Joshua Cohen
>Assignee: Jing Chen
>Priority: Trivial
>  Labels: newbie
> Fix For: 0.17.0
>
>
> I always forget to add an instance to the end of the job key when ssh'ing. It 
> might be nice if running {{aurora task ssh ...}} without specifying an 
> instance either picked a random instance or just defaulted to instance 0.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1875) The thriftw compatibility thrift binary check is too loose

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1875:

Fix Version/s: 0.17.0

> The thriftw compatibility thrift binary check is too loose
> --
>
> Key: AURORA-1875
> URL: https://issues.apache.org/jira/browse/AURORA-1875
> Project: Aurora
>  Issue Type: Bug
>Reporter: John Sirois
>Assignee: John Sirois
> Fix For: 0.17.0
>
>
> Right now the 
> [check|https://github.com/apache/aurora/blob/master/build-support/thrift/thriftw#L31]
>  is only for the proper version. We need to also check java and python 
> codegen are both supported by the binary since we use both.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1858) Expose stats on offers known to scheduler

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1858:

Fix Version/s: 0.17.0

> Expose stats on offers known to scheduler
> -
>
> Key: AURORA-1858
> URL: https://issues.apache.org/jira/browse/AURORA-1858
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
> Fix For: 0.17.0
>
>
> Expose stats on the number of offers tracked by {{OfferManager}}. This can 
> simply be defined as a collection size gauge on {{offers}} set.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1823:

Fix Version/s: 0.17.0

> `createJob` API uses single thread to move all tasks to PENDING 
> 
>
> Key: AURORA-1823
> URL: https://issues.apache.org/jira/browse/AURORA-1823
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
> Fix For: 0.17.0
>
>
> If you create a single job with many tasks (lets say 10k+) the `createJob` 
> API will take a long time. This is because the `createJob` API only returns 
> when all of the tasks have moved to PENDING and it uses a single thread to do 
> so. Here is a snippet of the logs:
> {noformat}
> ...
> I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
>  state machine transition INIT -> PENDING
> I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
> I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
>  state machine transition INIT -> PENDING
> I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
> I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
>  state machine transition INIT -> PENDING
> I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
> I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
>  state machine transition INIT -> PENDING
> I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
> I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
>  state machine transition INIT -> PENDING
> I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
> I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
>  state machine transition INIT -> PENDING
> I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
> I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
>  state machine transition INIT -> PENDING
> I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
> ...
> {noformat}
> Observe that a single jetty thread is doing this.
> We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1737) Descheduling a cron job checks role access before job key existence

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1737:

Fix Version/s: 0.17.0

> Descheduling a cron job checks role access before job key existence
> ---
>
> Key: AURORA-1737
> URL: https://issues.apache.org/jira/browse/AURORA-1737
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Joshua Cohen
>Assignee: Jing Chen
>Priority: Minor
> Fix For: 0.17.0
>
>
> Trying to deschedule a cron job for a non-existent role returns a permission 
> error rather than a no-such-job error. This leads to confusion for users in 
> the event of a typo in the role.
> Given that jobs are world-readable, we should check for a valid job key 
> before applying permissions.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1787) `-global_container_mounts` does not appear to work with the unified containerizer

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-1787:

Fix Version/s: 0.17.0

> `-global_container_mounts` does not appear to work with the unified 
> containerizer
> -
>
> Key: AURORA-1787
> URL: https://issues.apache.org/jira/browse/AURORA-1787
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Priority: Critical
> Fix For: 0.17.0
>
>
> Perhaps I misunderstand how this feature is supposed to be used, but apply 
> the following patch to master:
> {noformat}
> From 1ebb5f4c5815c647e31f3253d5e5c316a0d5edd2 Mon Sep 17 00:00:00 2001
> From: Zameer Manji 
> Date: Tue, 4 Oct 2016 20:45:41 -0700
> Subject: [PATCH] Reproduce the issue.
> ---
>  examples/vagrant/upstart/aurora-scheduler.conf |  2 +-
>  src/test/sh/org/apache/aurora/e2e/run-server.sh|  4 
>  .../sh/org/apache/aurora/e2e/test_end_to_end.sh| 26 
> +++---
>  3 files changed, 18 insertions(+), 14 deletions(-)
> diff --git a/examples/vagrant/upstart/aurora-scheduler.conf 
> b/examples/vagrant/upstart/aurora-scheduler.conf
> index 91b27d7..851b5a1 100644
> --- a/examples/vagrant/upstart/aurora-scheduler.conf
> +++ b/examples/vagrant/upstart/aurora-scheduler.conf
> @@ -40,7 +40,7 @@ exec bin/aurora-scheduler \
>-native_log_file_path=/var/db/aurora \
>-backup_dir=/var/lib/aurora/backups \
>-thermos_executor_path=$DIST_DIR/thermos_executor.pex \
> -  
> -global_container_mounts=/home/vagrant/aurora/examples/vagrant/config:/home/vagrant/aurora/examples/vagrant/config:ro
>  \
> +  -global_container_mounts=/etc/rsyslog.d:rsyslog.d.container:ro \
>-thermos_executor_flags="--announcer-ensemble localhost:2181 
> --announcer-zookeeper-auth-config 
> /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json 
> --mesos-containerizer-path=/usr/libexec/mesos/mesos-containerizer" \
>-allowed_container_types=MESOS,DOCKER \
>-http_authentication_mechanism=BASIC \
> diff --git a/src/test/sh/org/apache/aurora/e2e/run-server.sh 
> b/src/test/sh/org/apache/aurora/e2e/run-server.sh
> index 1fe0909..a0ee76f 100755
> --- a/src/test/sh/org/apache/aurora/e2e/run-server.sh
> +++ b/src/test/sh/org/apache/aurora/e2e/run-server.sh
> @@ -1,6 +1,10 @@
>  #!/bin/bash
>  
>  echo "Starting up server..."
> +if [ ! -d "./rsyslog.d.container" ]; then
> +  echo "Mountpoint Doesn't Exist";
> +  exit 1;
> +fi
>  while true
>  do
>echo -e "HTTP/1.1 200 OK\r\n\r\nHello from a filesystem image." | nc -l 
> "$1"
> diff --git a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh 
> b/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> index c93be9b..094d776 100755
> --- a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> +++ b/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh
> @@ -514,27 +514,27 @@ trap collect_result EXIT
>  aurorabuild all
>  setup_ssh
>  
> -test_version
> -test_http_example "${TEST_JOB_ARGS[@]}"
> -test_health_check
> +# test_version
> +# test_http_example "${TEST_JOB_ARGS[@]}"
> +# test_health_check
>  
> -test_http_example_basic "${TEST_JOB_REVOCABLE_ARGS[@]}"
> +# test_http_example_basic "${TEST_JOB_REVOCABLE_ARGS[@]}"
>  
> -test_http_example_basic "${TEST_JOB_GPU_ARGS[@]}"
> +# test_http_example_basic "${TEST_JOB_GPU_ARGS[@]}"
>  
>  # build the test docker image
> -sudo docker build -t http_example -f "${TEST_ROOT}/Dockerfile.python" 
> ${TEST_ROOT}
> -test_http_example "${TEST_JOB_DOCKER_ARGS[@]}"
> +# sudo docker build -t http_example -f "${TEST_ROOT}/Dockerfile.python" 
> ${TEST_ROOT}
> +# test_http_example "${TEST_JOB_DOCKER_ARGS[@]}"
>  
>  setup_image_stores
>  test_appc_unified
> -test_docker_unified
> +# test_docker_unified
>  
> -test_admin "${TEST_ADMIN_ARGS[@]}"
> -test_basic_auth_unauthenticated  "${TEST_JOB_ARGS[@]}"
> +# test_admin "${TEST_ADMIN_ARGS[@]}"
> +# test_basic_auth_unauthenticated  "${TEST_JOB_ARGS[@]}"
>  
> -test_ephemeral_daemon_with_final 
> "${TEST_JOB_EPHEMERAL_DAEMON_WITH_FINAL_ARGS[@]}"
> +# test_ephemeral_daemon_with_final 
> "${TEST_JOB_EPHEMERAL_DAEMON_WITH_FINAL_ARGS[@]}"
>  
> -/vagrant/src/test/sh/org/apache/aurora/e2e/test_kerberos_end_to_end.sh
> -/vagrant/src/test/sh/org/apache/aurora/e2e/test_bypass_leader_redirect_end_to_end.sh
> +# /vagrant/src/test/sh/org/apache/aurora/e2e/test_kerberos_end_to_end.sh
> +# 
> /vagrant/src/test/sh/org/apache/aurora/e2e/test_bypass_leader_redirect_end_to_end.sh
>  RETCODE=0
> -- 
> 2.10.0
> {noformat}
> You can apply the patch by copying the content to a {{.patch}} file and 
> running {{git am < file.patch}}
> Run the e2e tests.
> Observe that the tests fail because the tasks fail. The tasks fail because 
> the mountpoint in their sandbox does not exist.
> I observe the correct 

[jira] [Updated] (AURORA-894) Server updater should watch healthy instances

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-894:
---
Fix Version/s: 0.17.0

> Server updater should watch healthy instances
> -
>
> Key: AURORA-894
> URL: https://issues.apache.org/jira/browse/AURORA-894
> Project: Aurora
>  Issue Type: Epic
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>  Labels: 2015-Q2
> Fix For: 0.17.0
>
>
> Instead of starting the {{minWaitInInstanceRunningMs}} (aka {{watch_secs}}) 
> countdown when an instance reaches RUNNING state, the updater should rely on 
> the first successful health check instead. This will potentially speed up 
> updates as the {{minWaitInInstanceRunningMs}} will no longer have to be 
> chosen based on the worst observed instance startup/warmup delay but rather 
> as a desired health check duration according to the following formula:
> {noformat}
> minWaitInInstanceRunningMs = interval_secs x num_desired_healthchecks x 1000
> {noformat}
> where:
>   {{interval_secs}} - 
> https://github.com/apache/incubator-aurora/blob/master/docs/configuration-reference.md#healthcheckconfig-objects
>   {{num_desired_healthchecks}} - the desired number of OK health checks to 
> observe before declaring an instance updated successfully
>   
> The above would allow every instance to start watching interval depending on 
> the individual instance performance and potentially exit updater earlier. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-343) HTTP thrift service is not over SSL

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-343:
---
Fix Version/s: 0.17.0

> HTTP thrift service is not over SSL
> ---
>
> Key: AURORA-343
> URL: https://issues.apache.org/jira/browse/AURORA-343
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Bill Farner
>Assignee: Stephan Erb
>Priority: Minor
>  Labels: newbie
> Fix For: 0.17.0
>
>
> {{SchedulerAPIServlet}} is bound against the default debug HTTP server, which 
> is non-encrypted.  This leaves the door open to snooping.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-133) write_lock_wait_nanos stat is misleading and of little use

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb updated AURORA-133:
---
Fix Version/s: 0.17.0

> write_lock_wait_nanos stat is misleading and of little use
> --
>
> Key: AURORA-133
> URL: https://issues.apache.org/jira/browse/AURORA-133
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Bill Farner
>Priority: Minor
> Fix For: 0.17.0
>
>
> {{write_lock_wait_nanos}} is not useful since intrinsic lock on 
> {{LogStorage}} will be contended for and held by the time the read/write lock 
> is acquired



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Reopened] (AURORA-1712) Debian Jessie packagaes are embedding the mesos egg build for Ubuntu trusty

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb reopened AURORA-1712:
-

This bug is only fixed once we adapt the packaging scripts.

> Debian Jessie packagaes are embedding the mesos egg build for Ubuntu trusty
> ---
>
> Key: AURORA-1712
> URL: https://issues.apache.org/jira/browse/AURORA-1712
> Project: Aurora
>  Issue Type: Bug
>Reporter: Stephan Erb
>Assignee: Renan DelValle
>
> The Debian packaging scripts for Trusty and Jessie are sharing the same 
> override mechanism for the pants third_party repository. We therefore end up  
> using egg-files build for Ubuntu also on Debian 
> (https://github.com/apache/aurora-packaging/blob/master/specs/debian/aurora-pants.ini)
> It seems like this is kind of working, but is clearly not optimal.
> We should extend 
> https://github.com/apache/aurora/blob/master/build-support/python/make-mesos-native-egg
>  to support Debian and then make use of it in our packaging infrastructure 
> https://github.com/apache/aurora-packaging.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (AURORA-894) Server updater should watch healthy instances

2017-01-31 Thread Stephan Erb (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Stephan Erb resolved AURORA-894.

Resolution: Fixed

> Server updater should watch healthy instances
> -
>
> Key: AURORA-894
> URL: https://issues.apache.org/jira/browse/AURORA-894
> Project: Aurora
>  Issue Type: Epic
>  Components: Scheduler
>Reporter: Maxim Khutornenko
>  Labels: 2015-Q2
>
> Instead of starting the {{minWaitInInstanceRunningMs}} (aka {{watch_secs}}) 
> countdown when an instance reaches RUNNING state, the updater should rely on 
> the first successful health check instead. This will potentially speed up 
> updates as the {{minWaitInInstanceRunningMs}} will no longer have to be 
> chosen based on the worst observed instance startup/warmup delay but rather 
> as a desired health check duration according to the following formula:
> {noformat}
> minWaitInInstanceRunningMs = interval_secs x num_desired_healthchecks x 1000
> {noformat}
> where:
>   {{interval_secs}} - 
> https://github.com/apache/incubator-aurora/blob/master/docs/configuration-reference.md#healthcheckconfig-objects
>   {{num_desired_healthchecks}} - the desired number of OK health checks to 
> observe before declaring an instance updated successfully
>   
> The above would allow every instance to start watching interval depending on 
> the individual instance performance and potentially exit updater earlier. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)