[jira] [Updated] (AURORA-1788) vagrant up does not properly configure network adapters
[ https://issues.apache.org/jira/browse/AURORA-1788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1788: Fix Version/s: 0.17.0 > vagrant up does not properly configure network adapters > --- > > Key: AURORA-1788 > URL: https://issues.apache.org/jira/browse/AURORA-1788 > Project: Aurora > Issue Type: Bug >Reporter: Andrew Jorgensen >Assignee: Andrew Jorgensen > Fix For: 0.17.0 > > > I am not sure of the specifics of why this happens but on vagrant 1.8.6 the > network interface does not come up correctly and the private_network is > attached to the eth0 nat interface rather than the host-only interface. I > tried a number of different parameters but none of them were able to > configure the network appropriately. This change manually configures the > static ip so that it is connected to the correct adapter. Without this change > I could not access the aurora web interface when running vagrant up. > I've created a patch here: https://reviews.apache.org/r/52609/ > This is what the configuration looks like when run off master: > {code} > ip addr > 1: lo:mtu 65536 qdisc noqueue state UNKNOWN group > default > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo >valid_lft forever preferred_lft forever > inet6 ::1/128 scope host >valid_lft forever preferred_lft forever > 2: eth0: mtu 1500 qdisc pfifo_fast state UP > group default qlen 1000 > link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff > inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0 >valid_lft forever preferred_lft forever > inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1 >valid_lft forever preferred_lft forever > inet6 fe80::a00:27ff:feb3:1b30/64 scope link >valid_lft forever preferred_lft forever > 3: eth1: mtu 1500 qdisc pfifo_fast state > DOWN group default > link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff > 4: docker0: mtu 1500 qdisc noqueue state > DOWN group default > link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff > inet 172.17.0.1/16 scope global docker0 >valid_lft forever preferred_lft forever > {code} > here is what it is supposed to look like: > {code} > ip addr > 1: lo: mtu 65536 qdisc noqueue state UNKNOWN group > default > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > inet 127.0.0.1/8 scope host lo >valid_lft forever preferred_lft forever > inet6 ::1/128 scope host >valid_lft forever preferred_lft forever > 2: eth0: mtu 1500 qdisc pfifo_fast state UP > group default qlen 1000 > link/ether 08:00:27:b3:1b:30 brd ff:ff:ff:ff:ff:ff > inet 10.0.2.15/24 brd 10.0.2.255 scope global eth0 >valid_lft forever preferred_lft forever > inet6 fe80::a00:27ff:feb3:1b30/64 scope link >valid_lft forever preferred_lft forever > 3: eth1: mtu 1500 qdisc pfifo_fast state UP > group default qlen 1000 > link/ether 08:00:27:7c:4e:72 brd ff:ff:ff:ff:ff:ff > inet 192.168.33.7/24 brd 192.168.33.255 scope global eth1 >valid_lft forever preferred_lft forever > inet6 fe80::a00:27ff:fe7c:4e72/64 scope link >valid_lft forever preferred_lft forever > 4: docker0: mtu 1500 qdisc noqueue state > DOWN group default > link/ether 02:42:f6:de:a3:ca brd ff:ff:ff:ff:ff:ff > inet 172.17.0.1/16 scope global docker0 >valid_lft forever preferred_lft forever > {code} > Steps to reproduce: > 1. Update to vagrant 1.8.6 (unsure if previous versions are affected as well) > 2. Run `vagrant up` > 3. Try to visit http://192.168.33.7:8081 > Expected outcome: > I expect that following the steps in > http://aurora.apache.org/documentation/latest/getting-started/vagrant/ I > would be able to visit the web interface for aurora. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1224) Add a new "min_consecutive_health_checks" setting in .aurora config
[ https://issues.apache.org/jira/browse/AURORA-1224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1224: Fix Version/s: 0.17.0 > Add a new "min_consecutive_health_checks" setting in .aurora config > --- > > Key: AURORA-1224 > URL: https://issues.apache.org/jira/browse/AURORA-1224 > Project: Aurora > Issue Type: Task > Components: Client, Scheduler >Reporter: Maxim Khutornenko >Assignee: Kai Huang > Fix For: 0.17.0 > > > HealthCheckConfig should accept a new configuration value that will tell how > many positive consecutive health checks an instance requires to move from > STARTING to RUNNING. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1786) -zk_session_timeout option does not work
[ https://issues.apache.org/jira/browse/AURORA-1786?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1786: Fix Version/s: 0.17.0 > -zk_session_timeout option does not work > > > Key: AURORA-1786 > URL: https://issues.apache.org/jira/browse/AURORA-1786 > Project: Aurora > Issue Type: Bug >Reporter: David Robinson > Fix For: 0.17.0 > > > Looks like the -zk_session_timeout option has no affect. I've set > -zk_session_timeout="60mins" to attempt to work around ZK session timeouts > (due to GC pauses caused by TaskHistoryPruner pruning a huge number of > inactive tasks), but the default 30 seconds seems to always be used. > {noformat} > I0929 22:36:10.804 [main, ArgScanner:411] zk_chroot_path: null > I0929 22:36:10.804 [main, ArgScanner:411] zk_digest_credentials: : > I0929 22:36:10.805 [main, ArgScanner:411] zk_endpoints: [zk.example.com:2181] > I0929 22:36:10.805 [main, ArgScanner:411] zk_in_proc: false > I0929 22:36:10.805 [main, ArgScanner:411] zk_session_timeout: (30, mins) > I0929 22:36:10.805 [main, ArgScanner:411] zk_use_curator: true > {noformat} > {noformat} > I0929 22:48:37.678 [AsyncProcessor-3, TaskHistoryPruner:137] Pruning inactive > tasks > [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, > mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, > mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, > mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5 > I0929 22:48:37.738 [AsyncProcessor-5, TaskHistoryPruner:137] Pruning inactive > tasks > [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, > mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, > mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, > mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5 > 2016-09-29 > 22:48:37,794:47040(0x7f07f4c3c940):ZOO_WARN@zookeeper_interest@1570: Exceeded > deadline by 12ms > I0929 22:48:37.805 [AsyncProcessor-0, TaskHistoryPruner:137] Pruning inactive > tasks > [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, > mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, > mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, > mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5 > I0929 22:48:37.814 [AsyncProcessor-6, MemTaskStore:148] Query took 588 ms: > ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], > statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], > jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], > offset=0, limit=0} > I0929 22:48:37.867 [AsyncProcessor-1, TaskHistoryPruner:137] Pruning inactive > tasks > [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, > mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, > mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, > mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5 > I0929 22:48:37.873 [AsyncProcessor-2, MemTaskStore:148] Query took 304 ms: > ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], > statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], > jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], > offset=0, limit=0} > I0929 22:48:37.875 [AsyncProcessor-7, MemTaskStore:148] Query took 289 ms: > ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], > statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], > jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], > offset=0, limit=0} > I0929 22:48:37.886 [AsyncProcessor-4, TaskHistoryPruner:137] Pruning inactive > tasks > [mesos-test-healthy-daemon-19-3588-e2d79602-e354-4dc0-bfaa-b16d32e2b09d, > mesos-test-healthy-daemon-19-1551-b4b7e52f-f468-44ba-a1a9-ad3c95b602a3, > mesos-test-healthy-daemon-19-4105-ff87bef1-af09-4201-9cc2-863c8ece3621, > mesos-test-healthy-daemon-19-7416-66de9261-5fe5-47c4-be37-3dd5 > I0929 22:48:38.045 [AsyncProcessor-3, MemTaskStore:148] Query took 359 ms: > ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], > statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], > jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], > offset=0, limit=0} > I0929 22:48:38.152 [AsyncProcessor-5, MemTaskStore:148] Query took 405 ms: > ITaskQuery{role=null, environment=null, jobName=null, taskIds=[], > statuses=[FINISHED, FAILED, KILLED, LOST], instanceIds=[], slaveHosts=[], > jobKeys=[IJobKey{role=mesos, environment=test, name=healthy-daemon-19}], > offset=0, limit=0} > I0929 22:48:38.407 [AsyncProcessor-0, MemTaskStore:148]
[jira] [Updated] (AURORA-1878) Increased executor logs can lead to task's running out of disk space
[ https://issues.apache.org/jira/browse/AURORA-1878?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1878: Fix Version/s: 0.17.0 > Increased executor logs can lead to task's running out of disk space > > > Key: AURORA-1878 > URL: https://issues.apache.org/jira/browse/AURORA-1878 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Joshua Cohen >Assignee: Joshua Cohen > Fix For: 0.17.0 > > > After the health check for updates patch, this log statement is being emitted > once every 500ms: > https://github.com/apache/aurora/commit/2992c8b4#diff-6d60c873330419a828fb992f46d53372R121 > This is due to this > [code|https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/common/status_checker.py#L120-L124]: > {code} > if status_result is not None: > log.info('%s reported %s' % (status_checker.__class__.__name__, > status_result)) > {code} > Previously, {{status_result}} would be {{None}} unless the status checker had > a terminal event. Now, {{status_result}} will always be set, but we only > consider the {{status_result}} to be terminal if the {{status}} is not > {{TASK_STARTING}} or {{TASK_RUNNING}}. So, for the healthy case, we log that > the task is {{TASK_RUNNING}} every 500ms. > !https://frinkiac.com/meme/S10E02/818984.jpg?b64lines=IFRISVMgV0lMTCBTT1VORCBFVkVSWQogVEhSRUUgU0VDT05EUyBVTkxFU1MKIFNPTUVUSElORyBJU04nVCBPS0FZIQ==! -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1861) Remove duplicate Snapshot fields for DB stores
[ https://issues.apache.org/jira/browse/AURORA-1861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1861: Fix Version/s: 0.17.0 > Remove duplicate Snapshot fields for DB stores > -- > > Key: AURORA-1861 > URL: https://issues.apache.org/jira/browse/AURORA-1861 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: David McLaughlin >Assignee: David McLaughlin > Fix For: 0.17.0 > > Attachments: select-all-job-update-details time.png, > snapshot-create-time-only.png, snapshot-total-time.png > > > Currently we double-write any DB-backed stores into a Snapshot struct when > creating a Snapshot. This inflates the size of the Snapshot, which is already > a problem for large production clusters (see AURORA-74). > Example for LockStore from > https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/storage/log/SnapshotStoreImpl.java: > {code} > new SnapshotField() { > // It's important for locks to be replayed first, since there are > relations that expect > // references to be valid on insertion. > @Override > public void saveToSnapshot(MutableStoreProvider store, Snapshot > snapshot) { > > snapshot.setLocks(ILock.toBuildersSet(store.getLockStore().fetchLocks())); > } > @Override > public void restoreFromSnapshot(MutableStoreProvider store, Snapshot > snapshot) { > if (hasDbSnapshot(snapshot)) { > LOG.info("Deferring lock restore to dbsnapshot"); > return; > } > store.getLockStore().deleteLocks(); > if (snapshot.isSetLocks()) { > for (Lock lock : snapshot.getLocks()) { > store.getLockStore().saveLock(ILock.build(lock)); > } > } > } > }, > {code} > The saveToSnapshot here is totally redundant as the entire H2 database is > dumped into the dbScript field. > Note: one major side-effect here is if anyone is trying to read these > snapshots and utilize the data outside of Java - they'll lose the ability to > process the data without being able to apply the DB script. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1792) Executor does not log full task information.
[ https://issues.apache.org/jira/browse/AURORA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1792: Fix Version/s: 0.17.0 > Executor does not log full task information. > > > Key: AURORA-1792 > URL: https://issues.apache.org/jira/browse/AURORA-1792 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > Fix For: 0.17.0 > > > I launched a task that has an {{initial_interval_secs}} in the health check > config. However the log contains no information about this field: > {noformat} > $ grep "initial_interval_secs" __main__.log > {noformat} > We should log the entire ExecutorInfo blob. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1541) Observer logs are noisy
[ https://issues.apache.org/jira/browse/AURORA-1541?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1541: Fix Version/s: 0.17.0 > Observer logs are noisy > --- > > Key: AURORA-1541 > URL: https://issues.apache.org/jira/browse/AURORA-1541 > Project: Aurora > Issue Type: Bug > Components: Observer >Reporter: David Robinson >Assignee: Stephan Erb >Priority: Minor > Fix For: 0.17.0 > > > The observer's logs consist of lots of warnings about being unable to find > PIDs. This is likely due to the checkpoint pointing to PIDs that have been > cleaned by Mesos. > {noformat} > W1117 20:11:38.103549 33983 process_collector_psutil.py:76] Error during > process sampling: no process found with pid 39594 > W1117 20:11:38.151583 33983 process_collector_psutil.py:76] Error during > process sampling: no process found with pid 14012 > W1117 20:11:38.232773 33983 process_collector_psutil.py:76] Error during > process sampling: no process found with pid 26565 > W1117 20:11:38.486680 33983 process_collector_psutil.py:76] Error during > process sampling: no process found with pid 44902 > W1117 20:11:38.612293 33983 process_collector_psutil.py:76] Error during > process sampling: no process found with pid 32871 > W1117 20:11:38.694812 33983 process_collector_psutil.py:76] Error during > process sampling: no process found with pid 7182 > {noformat} > The warning messages should probably be debug messages, since Mesos cleaning > sandboxes is an expected operation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop
[ https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1789: Fix Version/s: 0.17.0 > Incorrect --mesos_containerizer_path value results in thermos failure loop > -- > > Key: AURORA-1789 > URL: https://issues.apache.org/jira/browse/AURORA-1789 > Project: Aurora > Issue Type: Bug > Components: Executor >Affects Versions: 0.16.0 >Reporter: Justin Pinkul >Assignee: Justin Pinkul > Fix For: 0.17.0 > > > When using the Mesos containerizer with namespaces/pid isolator and a Docker > image the Thermos executor is unable to launch processes. The executor tries > to fork the process then is unable to locate the process after the fork. > {code:title=thermos_runner.INFO} > I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, > coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789782.842882) > I1006 21:37:22.931456 75 helper.py:153] Coordinator BigBrother start [pid: > 1144] completed. > I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, > coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789842.935872) > I1006 21:38:23.025332 75 helper.py:153] Coordinator BigBrother start [pid: > 1157] completed. > I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, > coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789903.029694) > I1006 21:39:23.118841 75 helper.py:153] Coordinator BigBrother start [pid: > 1170] completed. > I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, > coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789963.123206) > I1006 21:40:23.212711 75 helper.py:153] Coordinator BigBrother start [pid: > 1183] completed. > I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, > coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475790023.21709) > I1006 21:41:23.307157 75 helper.py:153] Coordinator BigBrother start [pid: > 1196] completed. > I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, > coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475790083.311512) > I1006 21:42:23.399893 75 helper.py:153] Coordinator BigBrother start [pid: > 1209] completed. > I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an > abnormal termination > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1793) Revert Commit ca683 which is not backwards compatible
[ https://issues.apache.org/jira/browse/AURORA-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1793: Fix Version/s: 0.17.0 > Revert Commit ca683 which is not backwards compatible > - > > Key: AURORA-1793 > URL: https://issues.apache.org/jira/browse/AURORA-1793 > Project: Aurora > Issue Type: Bug >Reporter: Kai Huang >Assignee: Kai Huang >Priority: Blocker > Fix For: 0.17.0 > > > The commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 is not backwards > compatible. We decided to revert this commit. > The changes that directly causes problems is: > {code} > Modify executor state transition logic to rely on health checks (if enabled). > commit ca683cb9e27bae76424a687bc6c3af5a73c501b9 > {code} > There are two downstream commits that depends on the above commit: > {code} > Add min_consecutive_health_checks in HealthCheckConfig > commit ed72b1bf662d1e29d2bb483b317c787630c26a9e > {code} > {code} > Add support for receiving min_consecutive_successes in health checker > commit e91130e49445c3933b6e27f5fde18c3a0e61b87a > {code} > We will drop all three of these commits and revert back to one commit before > the problematic commit: > {code} > Running task ssh without an instance should pick a random instance > commit 59b4d319b8bb5f48ec3880e36f39527f1498a31c > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image
[ https://issues.apache.org/jira/browse/AURORA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1798: Fix Version/s: 0.17.0 > resolv.conf is not copied when using the Mesos containerizer with a Docker > image > > > Key: AURORA-1798 > URL: https://issues.apache.org/jira/browse/AURORA-1798 > Project: Aurora > Issue Type: Bug > Components: Executor >Affects Versions: 0.16.0 >Reporter: Justin Pinkul >Assignee: Justin Pinkul > Fix For: 0.17.0 > > > When Thermos launches a task using a Docker image it mounts the image as a > volume and manually chroots into it. One consequence of this is the logic > inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the > host into the new rootfs is bypassed. The Thermos executor should manually > copy this file into the rootfs until Mesos pod support is implemented. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1225) Modify executor state transition logic to rely on health checks (if enabled)
[ https://issues.apache.org/jira/browse/AURORA-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1225: Fix Version/s: 0.17.0 > Modify executor state transition logic to rely on health checks (if enabled) > > > Key: AURORA-1225 > URL: https://issues.apache.org/jira/browse/AURORA-1225 > Project: Aurora > Issue Type: Task > Components: Executor >Reporter: Maxim Khutornenko >Assignee: Santhosh Kumar Shanmugham > Fix For: 0.17.0 > > > Executor needs to start executing user content in STARTING and transition to > RUNNING when a successful required number of health checks is reached. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1791: Fix Version/s: 0.17.0 > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > Fix For: 0.17.0 > > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1795) Internal server error in scheduler Thrift API on missing Content-Type
[ https://issues.apache.org/jira/browse/AURORA-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1795: Fix Version/s: 0.17.0 > Internal server error in scheduler Thrift API on missing Content-Type > - > > Key: AURORA-1795 > URL: https://issues.apache.org/jira/browse/AURORA-1795 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Affects Versions: 0.16.0 >Reporter: Stephan Erb >Assignee: Zameer Manji > Fix For: 0.17.0 > > > This happens if a user has a very old browser, i.e. Firefox 41. > {code} > I1017 09:38:15.618 [qtp1426166274-44336, Slf4jRequestLog:60] 10.x.x.x - - > [17/Oct/2016:09:38:15 +] "POST //foobar.example.org/api HTTP/1.1" 200 794 > W1017 09:38:15.627 [qtp1426166274-44066, ServletHandler:631] /api > java.lang.NullPointerException: null > at java.util.Objects.requireNonNull(Objects.java:203) > ~[na:1.8.0-internal] > at java.util.Optional.(Optional.java:96) ~[na:1.8.0-internal] > at java.util.Optional.of(Optional.java:108) ~[na:1.8.0-internal] > at > org.apache.aurora.scheduler.http.api.TContentAwareServlet.doPost(TContentAwareServlet.java:123) > ~[aurora-0.16.0.jar:na] > at > org.apache.aurora.scheduler.http.api.TContentAwareServlet.doGet(TContentAwareServlet.java:164) > ~[aurora-0.16.0.jar:na] > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > ~[javax.servlet-api-3.1.0.jar:3.1.0] > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > ~[javax.servlet-api-3.1.0.jar:3.1.0] > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > org.apache.aurora.scheduler.http.LeaderRedirectFilter.doFilter(LeaderRedirectFilter.java:72) > ~[aurora-0.16.0.jar:na] > at > org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44) > ~[aurora-0.16.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > org.apache.aurora.scheduler.http.HttpStatsFilter.doFilter(HttpStatsFilter.java:71) > ~[aurora-0.16.0.jar:na] > at > org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44) > ~[aurora-0.16.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at >
[jira] [Updated] (AURORA-655) Order job update events and instance events by ID rather than timestamp
[ https://issues.apache.org/jira/browse/AURORA-655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-655: --- Fix Version/s: 0.17.0 > Order job update events and instance events by ID rather than timestamp > --- > > Key: AURORA-655 > URL: https://issues.apache.org/jira/browse/AURORA-655 > Project: Aurora > Issue Type: Story > Components: Scheduler >Reporter: Bill Farner >Assignee: Jing Chen >Priority: Trivial > Labels: newbie > Fix For: 0.17.0 > > > In {{JobUpdateDetailsMapper.xml}} we order by timestamps, which could be > brittle if the system time changes. Instead of using the timestamp, use the > built-in database {{IDENTITY}} for sort order. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1684) Cron tasks are sanitized multiple times (once when being created via the API, and again when actually being triggered)
[ https://issues.apache.org/jira/browse/AURORA-1684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1684: Fix Version/s: 0.17.0 > Cron tasks are sanitized multiple times (once when being created via the API, > and again when actually being triggered) > -- > > Key: AURORA-1684 > URL: https://issues.apache.org/jira/browse/AURORA-1684 > Project: Aurora > Issue Type: Bug >Reporter: Steve Niemitz >Assignee: Steve Niemitz > Fix For: 0.17.0 > > > This can cause issues in the following scenario: > - An operator sets default_docker_parameters on the scheduler > - The operator DOES NOT allow docker paramters (via allow_docker_parameters) > - A user schedules a cron job using a docker container. > Because the first pass of ConfigurationManager.validateAndPopulate will > mutate the task to have docker parameters (the defaults), the second pass in > SanitizedCronJob.fromUnsanitized will fail validation. > A solution here may be to remove fromUnsanitized and instead pass the job > configuration directly, since we know it will always be safe. > {code} > W0427 17:01:35.286 [QuartzScheduler_Worker-5, AuroraCronJob:134] Invalid cron > job for IJobKey{role=tcdc-infra, environment=prod, > name=security-group-alerter} in storage - failed to parse with {} > org.apache.aurora.scheduler.configuration.ConfigurationManager$TaskDescriptionException: > Docker parameters not allowed. > at > org.apache.aurora.scheduler.configuration.ConfigurationManager.validateAndPopulate(ConfigurationManager.java:249) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.configuration.ConfigurationManager.validateAndPopulate(ConfigurationManager.java:166) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.configuration.SanitizedConfiguration.fromUnsanitized(SanitizedConfiguration.java:60) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.cron.SanitizedCronJob.(SanitizedCronJob.java:45) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.cron.SanitizedCronJob.fromUnsanitized(SanitizedCronJob.java:102) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.cron.quartz.AuroraCronJob.lambda$doExecute$163(AuroraCronJob.java:132) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.storage.log.LogStorage.lambda$doInTransaction$222(LogStorage.java:524) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage.transactionedWrite(DbStorage.java:160) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.CGLIB$transactionedWrite$2() > ~[guice-3.0.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4$$FastClassByGuice$$e3e3ff55.invoke() > ~[guice-3.0.jar:na] > at > com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) > ~[guice-3.0.jar:na] > at > com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) > ~[guice-3.0.jar:na] > at > org.mybatis.guice.transactional.TransactionalMethodInterceptor.invoke(TransactionalMethodInterceptor.java:101) > ~[mybatis-guice-3.7.jar:3.7] > at > com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) > ~[guice-3.0.jar:na] > at > com.google.inject.internal.InterceptorStackCallback.intercept(InterceptorStackCallback.java:52) > ~[guice-3.0.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.transactionedWrite() > ~[guice-3.0.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage.lambda$write$188(DbStorage.java:174) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.async.GatingDelayExecutor.closeDuring(GatingDelayExecutor.java:62) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage.write(DbStorage.java:172) > ~[aurora-0.13.0-SNAPSHOT.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4.CGLIB$write$3() > ~[guice-3.0.jar:na] > at > org.apache.aurora.scheduler.storage.db.DbStorage$$EnhancerByGuice$$dd3bfcb4$$FastClassByGuice$$e3e3ff55.invoke() > ~[guice-3.0.jar:na] > at > com.google.inject.internal.cglib.proxy.$MethodProxy.invokeSuper(MethodProxy.java:228) > ~[guice-3.0.jar:na] > at > com.google.inject.internal.InterceptorStackCallback$InterceptedMethodInvocation.proceed(InterceptorStackCallback.java:72) >
[jira] [Updated] (AURORA-1794) Scheduler fails to start if -enable_revocable_ram is toggled
[ https://issues.apache.org/jira/browse/AURORA-1794?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1794: Fix Version/s: 0.17.0 > Scheduler fails to start if -enable_revocable_ram is toggled > > > Key: AURORA-1794 > URL: https://issues.apache.org/jira/browse/AURORA-1794 > Project: Aurora > Issue Type: Story >Affects Versions: 0.16.0 >Reporter: Stephan Erb >Assignee: Stephan Erb > Fix For: 0.17.0 > > > The scheduler does not start if {{-enable_revocable_ram}} is set: > {code} > Exception in thread "main" java.lang.IllegalStateException: A value cannot be > changed after it was read. > at > com.google.common.base.Preconditions.checkState(Preconditions.java:174) > at org.apache.aurora.common.args.Arg.set(Arg.java:54) > at > org.apache.aurora.common.args.ArgumentInfo.setValue(ArgumentInfo.java:128) > at org.apache.aurora.common.args.OptionInfo.load(OptionInfo.java:131) > at > org.apache.aurora.common.args.ArgScanner.process(ArgScanner.java:368) > at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:200) > at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:178) > at org.apache.aurora.common.args.ArgScanner.parse(ArgScanner.java:155) > at > org.apache.aurora.scheduler.app.SchedulerMain.applyStaticArgumentValues(SchedulerMain.java:226) > at > org.apache.aurora.scheduler.app.SchedulerMain.main(SchedulerMain.java:197) > {code} > This is an unfortunate oversight at my end. When introducing the feature, I > deferred the e2e test. It 'worked' in a manual test - at least that is what I > believed. Probably, I had only added the flag to the config in the repo, but > not to the one that was actually started in vagrant. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1880) How to set the environment variable for Mesos Containerizer?
[ https://issues.apache.org/jira/browse/AURORA-1880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1880: Fix Version/s: 0.17.0 > How to set the environment variable for Mesos Containerizer? > > > Key: AURORA-1880 > URL: https://issues.apache.org/jira/browse/AURORA-1880 > Project: Aurora > Issue Type: Bug >Affects Versions: 0.16.0, 0.15.0 >Reporter: jackyoh > Fix For: 0.17.0 > > > I'm running a Docker on an Aurora framework. > The question is: how to set the environment variable for Mesos Containerizer? > For example: > docker run -e ENV1=env1 ... -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1110) Running task ssh without an instance should pick a random instance
[ https://issues.apache.org/jira/browse/AURORA-1110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1110: Fix Version/s: 0.17.0 > Running task ssh without an instance should pick a random instance > -- > > Key: AURORA-1110 > URL: https://issues.apache.org/jira/browse/AURORA-1110 > Project: Aurora > Issue Type: Story > Components: Client >Reporter: Joshua Cohen >Assignee: Jing Chen >Priority: Trivial > Labels: newbie > Fix For: 0.17.0 > > > I always forget to add an instance to the end of the job key when ssh'ing. It > might be nice if running {{aurora task ssh ...}} without specifying an > instance either picked a random instance or just defaulted to instance 0. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1875) The thriftw compatibility thrift binary check is too loose
[ https://issues.apache.org/jira/browse/AURORA-1875?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1875: Fix Version/s: 0.17.0 > The thriftw compatibility thrift binary check is too loose > -- > > Key: AURORA-1875 > URL: https://issues.apache.org/jira/browse/AURORA-1875 > Project: Aurora > Issue Type: Bug >Reporter: John Sirois >Assignee: John Sirois > Fix For: 0.17.0 > > > Right now the > [check|https://github.com/apache/aurora/blob/master/build-support/thrift/thriftw#L31] > is only for the proper version. We need to also check java and python > codegen are both supported by the binary since we use both. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1858) Expose stats on offers known to scheduler
[ https://issues.apache.org/jira/browse/AURORA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1858: Fix Version/s: 0.17.0 > Expose stats on offers known to scheduler > - > > Key: AURORA-1858 > URL: https://issues.apache.org/jira/browse/AURORA-1858 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Mehrdad Nurolahzade >Priority: Minor > Labels: newbie > Fix For: 0.17.0 > > > Expose stats on the number of offers tracked by {{OfferManager}}. This can > simply be defined as a collection size gauge on {{offers}} set. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING
[ https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1823: Fix Version/s: 0.17.0 > `createJob` API uses single thread to move all tasks to PENDING > > > Key: AURORA-1823 > URL: https://issues.apache.org/jira/browse/AURORA-1823 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji >Priority: Minor > Fix For: 0.17.0 > > > If you create a single job with many tasks (lets say 10k+) the `createJob` > API will take a long time. This is because the `createJob` API only returns > when all of the tasks have moved to PENDING and it uses a single thread to do > so. Here is a snippet of the logs: > {noformat} > ... > I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > state machine transition INIT -> PENDING > I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > state machine transition INIT -> PENDING > I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > state machine transition INIT -> PENDING > I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > state machine transition INIT -> PENDING > I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > state machine transition INIT -> PENDING > I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > state machine transition INIT -> PENDING > I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > state machine transition INIT -> PENDING > I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > ... > {noformat} > Observe that a single jetty thread is doing this. > We should leverage {{BatchWorker}} to have concurrent mutations here. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1737) Descheduling a cron job checks role access before job key existence
[ https://issues.apache.org/jira/browse/AURORA-1737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1737: Fix Version/s: 0.17.0 > Descheduling a cron job checks role access before job key existence > --- > > Key: AURORA-1737 > URL: https://issues.apache.org/jira/browse/AURORA-1737 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Reporter: Joshua Cohen >Assignee: Jing Chen >Priority: Minor > Fix For: 0.17.0 > > > Trying to deschedule a cron job for a non-existent role returns a permission > error rather than a no-such-job error. This leads to confusion for users in > the event of a typo in the role. > Given that jobs are world-readable, we should check for a valid job key > before applying permissions. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1787) `-global_container_mounts` does not appear to work with the unified containerizer
[ https://issues.apache.org/jira/browse/AURORA-1787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-1787: Fix Version/s: 0.17.0 > `-global_container_mounts` does not appear to work with the unified > containerizer > - > > Key: AURORA-1787 > URL: https://issues.apache.org/jira/browse/AURORA-1787 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Priority: Critical > Fix For: 0.17.0 > > > Perhaps I misunderstand how this feature is supposed to be used, but apply > the following patch to master: > {noformat} > From 1ebb5f4c5815c647e31f3253d5e5c316a0d5edd2 Mon Sep 17 00:00:00 2001 > From: Zameer Manji> Date: Tue, 4 Oct 2016 20:45:41 -0700 > Subject: [PATCH] Reproduce the issue. > --- > examples/vagrant/upstart/aurora-scheduler.conf | 2 +- > src/test/sh/org/apache/aurora/e2e/run-server.sh| 4 > .../sh/org/apache/aurora/e2e/test_end_to_end.sh| 26 > +++--- > 3 files changed, 18 insertions(+), 14 deletions(-) > diff --git a/examples/vagrant/upstart/aurora-scheduler.conf > b/examples/vagrant/upstart/aurora-scheduler.conf > index 91b27d7..851b5a1 100644 > --- a/examples/vagrant/upstart/aurora-scheduler.conf > +++ b/examples/vagrant/upstart/aurora-scheduler.conf > @@ -40,7 +40,7 @@ exec bin/aurora-scheduler \ >-native_log_file_path=/var/db/aurora \ >-backup_dir=/var/lib/aurora/backups \ >-thermos_executor_path=$DIST_DIR/thermos_executor.pex \ > - > -global_container_mounts=/home/vagrant/aurora/examples/vagrant/config:/home/vagrant/aurora/examples/vagrant/config:ro > \ > + -global_container_mounts=/etc/rsyslog.d:rsyslog.d.container:ro \ >-thermos_executor_flags="--announcer-ensemble localhost:2181 > --announcer-zookeeper-auth-config > /home/vagrant/aurora/examples/vagrant/config/announcer-auth.json > --mesos-containerizer-path=/usr/libexec/mesos/mesos-containerizer" \ >-allowed_container_types=MESOS,DOCKER \ >-http_authentication_mechanism=BASIC \ > diff --git a/src/test/sh/org/apache/aurora/e2e/run-server.sh > b/src/test/sh/org/apache/aurora/e2e/run-server.sh > index 1fe0909..a0ee76f 100755 > --- a/src/test/sh/org/apache/aurora/e2e/run-server.sh > +++ b/src/test/sh/org/apache/aurora/e2e/run-server.sh > @@ -1,6 +1,10 @@ > #!/bin/bash > > echo "Starting up server..." > +if [ ! -d "./rsyslog.d.container" ]; then > + echo "Mountpoint Doesn't Exist"; > + exit 1; > +fi > while true > do >echo -e "HTTP/1.1 200 OK\r\n\r\nHello from a filesystem image." | nc -l > "$1" > diff --git a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > b/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > index c93be9b..094d776 100755 > --- a/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > +++ b/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh > @@ -514,27 +514,27 @@ trap collect_result EXIT > aurorabuild all > setup_ssh > > -test_version > -test_http_example "${TEST_JOB_ARGS[@]}" > -test_health_check > +# test_version > +# test_http_example "${TEST_JOB_ARGS[@]}" > +# test_health_check > > -test_http_example_basic "${TEST_JOB_REVOCABLE_ARGS[@]}" > +# test_http_example_basic "${TEST_JOB_REVOCABLE_ARGS[@]}" > > -test_http_example_basic "${TEST_JOB_GPU_ARGS[@]}" > +# test_http_example_basic "${TEST_JOB_GPU_ARGS[@]}" > > # build the test docker image > -sudo docker build -t http_example -f "${TEST_ROOT}/Dockerfile.python" > ${TEST_ROOT} > -test_http_example "${TEST_JOB_DOCKER_ARGS[@]}" > +# sudo docker build -t http_example -f "${TEST_ROOT}/Dockerfile.python" > ${TEST_ROOT} > +# test_http_example "${TEST_JOB_DOCKER_ARGS[@]}" > > setup_image_stores > test_appc_unified > -test_docker_unified > +# test_docker_unified > > -test_admin "${TEST_ADMIN_ARGS[@]}" > -test_basic_auth_unauthenticated "${TEST_JOB_ARGS[@]}" > +# test_admin "${TEST_ADMIN_ARGS[@]}" > +# test_basic_auth_unauthenticated "${TEST_JOB_ARGS[@]}" > > -test_ephemeral_daemon_with_final > "${TEST_JOB_EPHEMERAL_DAEMON_WITH_FINAL_ARGS[@]}" > +# test_ephemeral_daemon_with_final > "${TEST_JOB_EPHEMERAL_DAEMON_WITH_FINAL_ARGS[@]}" > > -/vagrant/src/test/sh/org/apache/aurora/e2e/test_kerberos_end_to_end.sh > -/vagrant/src/test/sh/org/apache/aurora/e2e/test_bypass_leader_redirect_end_to_end.sh > +# /vagrant/src/test/sh/org/apache/aurora/e2e/test_kerberos_end_to_end.sh > +# > /vagrant/src/test/sh/org/apache/aurora/e2e/test_bypass_leader_redirect_end_to_end.sh > RETCODE=0 > -- > 2.10.0 > {noformat} > You can apply the patch by copying the content to a {{.patch}} file and > running {{git am < file.patch}} > Run the e2e tests. > Observe that the tests fail because the tasks fail. The tasks fail because > the mountpoint in their sandbox does not exist. > I observe the correct
[jira] [Updated] (AURORA-894) Server updater should watch healthy instances
[ https://issues.apache.org/jira/browse/AURORA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-894: --- Fix Version/s: 0.17.0 > Server updater should watch healthy instances > - > > Key: AURORA-894 > URL: https://issues.apache.org/jira/browse/AURORA-894 > Project: Aurora > Issue Type: Epic > Components: Scheduler >Reporter: Maxim Khutornenko > Labels: 2015-Q2 > Fix For: 0.17.0 > > > Instead of starting the {{minWaitInInstanceRunningMs}} (aka {{watch_secs}}) > countdown when an instance reaches RUNNING state, the updater should rely on > the first successful health check instead. This will potentially speed up > updates as the {{minWaitInInstanceRunningMs}} will no longer have to be > chosen based on the worst observed instance startup/warmup delay but rather > as a desired health check duration according to the following formula: > {noformat} > minWaitInInstanceRunningMs = interval_secs x num_desired_healthchecks x 1000 > {noformat} > where: > {{interval_secs}} - > https://github.com/apache/incubator-aurora/blob/master/docs/configuration-reference.md#healthcheckconfig-objects > {{num_desired_healthchecks}} - the desired number of OK health checks to > observe before declaring an instance updated successfully > > The above would allow every instance to start watching interval depending on > the individual instance performance and potentially exit updater earlier. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-343) HTTP thrift service is not over SSL
[ https://issues.apache.org/jira/browse/AURORA-343?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-343: --- Fix Version/s: 0.17.0 > HTTP thrift service is not over SSL > --- > > Key: AURORA-343 > URL: https://issues.apache.org/jira/browse/AURORA-343 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Reporter: Bill Farner >Assignee: Stephan Erb >Priority: Minor > Labels: newbie > Fix For: 0.17.0 > > > {{SchedulerAPIServlet}} is bound against the default debug HTTP server, which > is non-encrypted. This leaves the door open to snooping. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-133) write_lock_wait_nanos stat is misleading and of little use
[ https://issues.apache.org/jira/browse/AURORA-133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb updated AURORA-133: --- Fix Version/s: 0.17.0 > write_lock_wait_nanos stat is misleading and of little use > -- > > Key: AURORA-133 > URL: https://issues.apache.org/jira/browse/AURORA-133 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Reporter: Bill Farner >Priority: Minor > Fix For: 0.17.0 > > > {{write_lock_wait_nanos}} is not useful since intrinsic lock on > {{LogStorage}} will be contended for and held by the time the read/write lock > is acquired -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Reopened] (AURORA-1712) Debian Jessie packagaes are embedding the mesos egg build for Ubuntu trusty
[ https://issues.apache.org/jira/browse/AURORA-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb reopened AURORA-1712: - This bug is only fixed once we adapt the packaging scripts. > Debian Jessie packagaes are embedding the mesos egg build for Ubuntu trusty > --- > > Key: AURORA-1712 > URL: https://issues.apache.org/jira/browse/AURORA-1712 > Project: Aurora > Issue Type: Bug >Reporter: Stephan Erb >Assignee: Renan DelValle > > The Debian packaging scripts for Trusty and Jessie are sharing the same > override mechanism for the pants third_party repository. We therefore end up > using egg-files build for Ubuntu also on Debian > (https://github.com/apache/aurora-packaging/blob/master/specs/debian/aurora-pants.ini) > It seems like this is kind of working, but is clearly not optimal. > We should extend > https://github.com/apache/aurora/blob/master/build-support/python/make-mesos-native-egg > to support Debian and then make use of it in our packaging infrastructure > https://github.com/apache/aurora-packaging. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AURORA-894) Server updater should watch healthy instances
[ https://issues.apache.org/jira/browse/AURORA-894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephan Erb resolved AURORA-894. Resolution: Fixed > Server updater should watch healthy instances > - > > Key: AURORA-894 > URL: https://issues.apache.org/jira/browse/AURORA-894 > Project: Aurora > Issue Type: Epic > Components: Scheduler >Reporter: Maxim Khutornenko > Labels: 2015-Q2 > > Instead of starting the {{minWaitInInstanceRunningMs}} (aka {{watch_secs}}) > countdown when an instance reaches RUNNING state, the updater should rely on > the first successful health check instead. This will potentially speed up > updates as the {{minWaitInInstanceRunningMs}} will no longer have to be > chosen based on the worst observed instance startup/warmup delay but rather > as a desired health check duration according to the following formula: > {noformat} > minWaitInInstanceRunningMs = interval_secs x num_desired_healthchecks x 1000 > {noformat} > where: > {{interval_secs}} - > https://github.com/apache/incubator-aurora/blob/master/docs/configuration-reference.md#healthcheckconfig-objects > {{num_desired_healthchecks}} - the desired number of OK health checks to > observe before declaring an instance updated successfully > > The above would allow every instance to start watching interval depending on > the individual instance performance and potentially exit updater earlier. -- This message was sent by Atlassian JIRA (v6.3.15#6346)