[jira] [Commented] (AURORA-1897) Remove task length restrictions.
[ https://issues.apache.org/jira/browse/AURORA-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047016#comment-16047016 ] Zameer Manji commented on AURORA-1897: -- {noformat} commit 40d9d4dbec86cb4a17e281dc10ede25e83613eff Author: Zameer ManjiDate: Mon Jun 12 13:14:18 2017 -0700 Remove restriction on task id length. To work around an old Mesos bug (MESOS-691) we would reject jobs that resulted in Mesos task ids longer than 255 characters. This is because Mesos used to use the task id to generate the cgroup path. Now Mesos uses it's own id, we no longer need to work around this bug. This removes the restriction in the API layer. This is useful because some users may have very long role and service names that caused task ids to go over this limit. Bugs closed: AURORA-1897 Reviewed at https://reviews.apache.org/r/59957/ .../scheduler/thrift/SchedulerThriftInterface.java | 22 - .../thrift/SchedulerThriftInterfaceTest.java | 99 -- 2 files changed, 121 deletions(-) {noformat} > Remove task length restrictions. > > > Key: AURORA-1897 > URL: https://issues.apache.org/jira/browse/AURORA-1897 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji >Priority: Minor > > Currently we restrict the total name of a task because of a Mesos bug: > {noformat} > // This number is derived from the maximum file name length limit on most > UNIX systems, less > // the number of characters we've observed being added by mesos for the > executor ID, prefix, and > // delimiters. > @VisibleForTesting > static final int MAX_TASK_ID_LENGTH = 255 - 90; > > // TODO(maximk): This is a short-term hack to stop the bleeding from > // https://issues.apache.org/jira/browse/MESOS-691 > if (taskIdGenerator.generate(task, totalInstances).length() > > MAX_TASK_ID_LENGTH) { > throw new TaskValidationException( > "Task ID is too long, please shorten your role or job name."); > } > {noformat} > However [~codyg] recently > [asked|https://lists.apache.org/thread.html/ca92420fe6394d6467f70989e1ffadac23775e84cf7356ff8c9efdd5@%3Cdev.mesos.apache.org%3E] > on the mesos mailing list about MESOS-691 and learned that it is no longer > valid. > We should remove this restriction. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AURORA-1933) Scheduler can process rescind before offer
Zameer Manji created AURORA-1933: Summary: Scheduler can process rescind before offer Key: AURORA-1933 URL: https://issues.apache.org/jira/browse/AURORA-1933 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Assignee: Zameer Manji I observed the following in production: {noformat} Jun 6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.510 [Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 Jun 6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.903 [SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 Jun 6 00:31:34 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:34.815 [TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH] {noformat} Notice the rescind was processed before the offer was given. This means the offer is in the offer storage, but using it is invalid. It will cause whatever task launched with it to fail with {{Task launched with invalid offers: Offer 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 is no longer valid}} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1914) Unable to specify multiple volumes per task.
Zameer Manji created AURORA-1914: Summary: Unable to specify multiple volumes per task. Key: AURORA-1914 URL: https://issues.apache.org/jira/browse/AURORA-1914 Project: Aurora Issue Type: Bug Reporter: Zameer Manji There is an artificial constraint in the schema which prevents multiple volumes per task. This was not caught before in testing. Removing the constraint should solve the problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1914) Unable to specify multiple volumes per task.
[ https://issues.apache.org/jira/browse/AURORA-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1914: Assignee: Zameer Manji > Unable to specify multiple volumes per task. > > > Key: AURORA-1914 > URL: https://issues.apache.org/jira/browse/AURORA-1914 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > There is an artificial constraint in the schema which prevents multiple > volumes per task. This was not caught before in testing. Removing the > constraint should solve the problem. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1911) HTTP Scheduler Driver does not reliably re subscribe
[ https://issues.apache.org/jira/browse/AURORA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948114#comment-15948114 ] Zameer Manji commented on AURORA-1911: -- First part here: https://reviews.apache.org/r/58053/ > HTTP Scheduler Driver does not reliably re subscribe > > > Key: AURORA-1911 > URL: https://issues.apache.org/jira/browse/AURORA-1911 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > I observed this issue in a large production cluster during a period of Mesos > Master instability: > 1. Mesos master crashes or restarts. > 2. {{V1Mesos}} driver detects this and reconnects. > 3. Aurora does the {{SUBSCRIBE}} call again. > 4. The {{SUBSCRIBE}} Call fails silently in the driver. > 5. All future calls are silently dropped by the driver. > 6. Aurora has no offers because it is not subscribed. > Logs: > {noformat} > I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at > http://10.162.14.30:5050/master/api/v1/scheduler > W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service > Unavailable' () for SUBSCRIBE > > W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is > in state CONNECTED > > W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is > in state CONNECTED > > W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is > in state CONNECTED > ... > {noformat} > To fix this, the {{VersionedSchedulerDriver}} needs to do two things: > 1. Block calls when unsubscribed not just disconnected. > 2. Retry the {{SUBSCRIBE}} call repeatedly with exponential backoff. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1912) DbSnapShot may remove enum values
[ https://issues.apache.org/jira/browse/AURORA-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1912: Assignee: Zameer Manji > DbSnapShot may remove enum values > - > > Key: AURORA-1912 > URL: https://issues.apache.org/jira/browse/AURORA-1912 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > The dbnsapshot restore may truncate enum tables and cause referential > integrity issues. From the code it restores from the SQL dump by first > dropping all tables: > {noformat} > try (Connection c = ((DataSource) > store.getUnsafeStoreAccess()).getConnection()) { > LOG.info("Dropping all tables"); > try (PreparedStatement drop = c.prepareStatement("DROP ALL > OBJECTS")) { > drop.executeUpdate(); > } > {noformat} > However a freshly started leader will have some data in there from preparing > the storage: > {noformat} > @Override > @Transactional > protected void startUp() throws IOException { > Configuration configuration = sessionFactory.getConfiguration(); > String createStatementName = "create_tables"; > configuration.setMapUnderscoreToCamelCase(true); > // The ReuseExecutor will cache jdbc Statements with equivalent SQL, > improving performance > // slightly when redundant queries are made. > configuration.setDefaultExecutorType(ExecutorType.REUSE); > addMappedStatement( > configuration, > createStatementName, > CharStreams.toString(new InputStreamReader( > DbStorage.class.getResourceAsStream("schema.sql"), > StandardCharsets.UTF_8))); > try (SqlSession session = sessionFactory.openSession()) { > session.update(createStatementName); > } > for (CronCollisionPolicy policy : CronCollisionPolicy.values()) { > enumValueMapper.addEnumValue("cron_policies", policy.getValue(), > policy.name()); > } > for (MaintenanceMode mode : MaintenanceMode.values()) { > enumValueMapper.addEnumValue("maintenance_modes", mode.getValue(), > mode.name()); > } > for (JobUpdateStatus status : JobUpdateStatus.values()) { > enumValueMapper.addEnumValue("job_update_statuses", status.getValue(), > status.name()); > } > for (JobUpdateAction action : JobUpdateAction.values()) { > enumValueMapper.addEnumValue("job_instance_update_actions", > action.getValue(), action.name()); > } > for (ScheduleStatus status : ScheduleStatus.values()) { > enumValueMapper.addEnumValue("task_states", status.getValue(), > status.name()); > } > for (ResourceType resourceType : ResourceType.values()) { > enumValueMapper.addEnumValue("resource_types", resourceType.getValue(), > resourceType.name()); > } > for (Mode mode : Mode.values()) { > enumValueMapper.addEnumValue("volume_modes", mode.getValue(), > mode.name()); > } > createPoolMetrics(); > } > {noformat} > Consider the case where we add a new value to an existing enum. This means > restoring from a snapshot will not allow us to have that value in the enum > table. > To fix this we should have a migration for every enum value we add. However > to me it seems that the better idea would be to update the enum tables after > we restore from a snapshot. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1912) DbSnapShot may remove enum values
Zameer Manji created AURORA-1912: Summary: DbSnapShot may remove enum values Key: AURORA-1912 URL: https://issues.apache.org/jira/browse/AURORA-1912 Project: Aurora Issue Type: Bug Reporter: Zameer Manji The dbnsapshot restore may truncate enum tables and cause referential integrity issues. From the code it restores from the SQL dump by first dropping all tables: {noformat} try (Connection c = ((DataSource) store.getUnsafeStoreAccess()).getConnection()) { LOG.info("Dropping all tables"); try (PreparedStatement drop = c.prepareStatement("DROP ALL OBJECTS")) { drop.executeUpdate(); } {noformat} However a freshly started leader will have some data in there from preparing the storage: {noformat} @Override @Transactional protected void startUp() throws IOException { Configuration configuration = sessionFactory.getConfiguration(); String createStatementName = "create_tables"; configuration.setMapUnderscoreToCamelCase(true); // The ReuseExecutor will cache jdbc Statements with equivalent SQL, improving performance // slightly when redundant queries are made. configuration.setDefaultExecutorType(ExecutorType.REUSE); addMappedStatement( configuration, createStatementName, CharStreams.toString(new InputStreamReader( DbStorage.class.getResourceAsStream("schema.sql"), StandardCharsets.UTF_8))); try (SqlSession session = sessionFactory.openSession()) { session.update(createStatementName); } for (CronCollisionPolicy policy : CronCollisionPolicy.values()) { enumValueMapper.addEnumValue("cron_policies", policy.getValue(), policy.name()); } for (MaintenanceMode mode : MaintenanceMode.values()) { enumValueMapper.addEnumValue("maintenance_modes", mode.getValue(), mode.name()); } for (JobUpdateStatus status : JobUpdateStatus.values()) { enumValueMapper.addEnumValue("job_update_statuses", status.getValue(), status.name()); } for (JobUpdateAction action : JobUpdateAction.values()) { enumValueMapper.addEnumValue("job_instance_update_actions", action.getValue(), action.name()); } for (ScheduleStatus status : ScheduleStatus.values()) { enumValueMapper.addEnumValue("task_states", status.getValue(), status.name()); } for (ResourceType resourceType : ResourceType.values()) { enumValueMapper.addEnumValue("resource_types", resourceType.getValue(), resourceType.name()); } for (Mode mode : Mode.values()) { enumValueMapper.addEnumValue("volume_modes", mode.getValue(), mode.name()); } createPoolMetrics(); } {noformat} Consider the case where we add a new value to an existing enum. This means restoring from a snapshot will not allow us to have that value in the enum table. To fix this we should have a migration for every enum value we add. However to me it seems that the better idea would be to update the enum tables after we restore from a snapshot. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1910) framework_registered metric isn't reset when scheduler disconnects
[ https://issues.apache.org/jira/browse/AURORA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1910: - Summary: framework_registered metric isn't reset when scheduler disconnects (was: framework_registered metric doesn't reset when scheduler disconnects) > framework_registered metric isn't reset when scheduler disconnects > -- > > Key: AURORA-1910 > URL: https://issues.apache.org/jira/browse/AURORA-1910 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > Right now the {{framework_registered}} metric transitions from 0 -> 1 when > the scheduler registers successfully the first time. It never transitions > from 1 -> 0 when it loses a connection. > This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the > gauge as the scheduler loses registration and re-registers. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1911) HTTP Scheduler Driver does not reliable re subscribe
Zameer Manji created AURORA-1911: Summary: HTTP Scheduler Driver does not reliable re subscribe Key: AURORA-1911 URL: https://issues.apache.org/jira/browse/AURORA-1911 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Assignee: Zameer Manji I observed this issue in a large production cluster during a period of Mesos Master instability: 1. Mesos master crashes or restarts. 2. {{V1Mesos}} driver detects this and reconnects. 3. Aurora does the {{SUBSCRIBE}} call again. 4. The {{SUBSCRIBE}} Call fails silently in the driver. 5. All future calls are silently dropped by the driver. 6. Aurora has no offers because it is not subscribed. Logs: {noformat} I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at http://10.162.14.30:5050/master/api/v1/scheduler W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service Unavailable' () for SUBSCRIBE W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is in state CONNECTED ... {noformat} To fix this, the {{VersionedSchedulerDriver}} needs to do two things: 1. Block calls when unsubscribed not just disconnected. 2. Retry the {{SUBSCRIBE}} call repeatedly with exponential backoff. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1910) framework_registered metric doesn't reset when scheduler disconnects
Zameer Manji created AURORA-1910: Summary: framework_registered metric doesn't reset when scheduler disconnects Key: AURORA-1910 URL: https://issues.apache.org/jira/browse/AURORA-1910 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Right now the {{framework_registered}} metric transitions from 0 -> 1 when the scheduler registers successfully the first time. It never transitions from 1 -> 0 when it loses a connection. This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the gauge as the scheduler loses registration and re-registers. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1910) framework_registered metric doesn't reset when scheduler disconnects
[ https://issues.apache.org/jira/browse/AURORA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1910: Assignee: Zameer Manji > framework_registered metric doesn't reset when scheduler disconnects > > > Key: AURORA-1910 > URL: https://issues.apache.org/jira/browse/AURORA-1910 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > Right now the {{framework_registered}} metric transitions from 0 -> 1 when > the scheduler registers successfully the first time. It never transitions > from 1 -> 0 when it loses a connection. > This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the > gauge as the scheduler loses registration and re-registers. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host
[ https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937282#comment-15937282 ] Zameer Manji commented on AURORA-1908: -- We label {{Vetos}} with a {{VetoType}} which is {{STATIC}} or {{DYNAMIC}}. To me this can be generalized to short circuit if all of the vetoes are {{STATIC}} > Short-circuit preemption filtering when a Veto applies to entire host > - > > Key: AURORA-1908 > URL: https://issues.apache.org/jira/browse/AURORA-1908 > Project: Aurora > Issue Type: Task >Reporter: Santhosh Kumar Shanmugham >Priority: Minor > > When matching a {{ResourceRequest}} against a {{UnusedResource}} in > {{PremeptionVictimFilter.filterPremeptionVictions}} there are 4 kinds of > {{Veto}} es that can be returned. 3 out of the 4 {{Veto}} es apply to the > entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, > {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can > short-circuit and return early and move on to the next host to consider. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1905) Set "webui_url" field of FrameworkInfo
[ https://issues.apache.org/jira/browse/AURORA-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1905: Assignee: Zameer Manji https://reviews.apache.org/r/57708/ > Set "webui_url" field of FrameworkInfo > -- > > Key: AURORA-1905 > URL: https://issues.apache.org/jira/browse/AURORA-1905 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji > > Aurora should set the {{webui_url}} field of FrameworkInfo so the Mesos UI > can link to the current leader. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1906) aurora update info command should print out update metadata
Zameer Manji created AURORA-1906: Summary: aurora update info command should print out update metadata Key: AURORA-1906 URL: https://issues.apache.org/jira/browse/AURORA-1906 Project: Aurora Issue Type: Bug Reporter: Zameer Manji AURORA-1711 added metadata fields to update request. The CLI should allow users to inspect that metadata. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1905) Set "webui_url" field of FrameworkInfo
Zameer Manji created AURORA-1905: Summary: Set "webui_url" field of FrameworkInfo Key: AURORA-1905 URL: https://issues.apache.org/jira/browse/AURORA-1905 Project: Aurora Issue Type: Task Reporter: Zameer Manji Aurora should set the {{webui_url}} field of FrameworkInfo so the Mesos UI can link to the current leader. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1904) Support Mesos Maintenance
Zameer Manji created AURORA-1904: Summary: Support Mesos Maintenance Key: AURORA-1904 URL: https://issues.apache.org/jira/browse/AURORA-1904 Project: Aurora Issue Type: Task Reporter: Zameer Manji Priority: Minor Support Mesos Maintenance primitives in Aurora per the design [doc|https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumkIbHrk0-ceoD2YI/edit#heading=h.n5tvzjaj9llx]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1904) Support Mesos Maintenance
[ https://issues.apache.org/jira/browse/AURORA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1904: Assignee: Zameer Manji > Support Mesos Maintenance > - > > Key: AURORA-1904 > URL: https://issues.apache.org/jira/browse/AURORA-1904 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji >Priority: Minor > > Support Mesos Maintenance primitives in Aurora per the design > [doc|https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumkIbHrk0-ceoD2YI/edit#heading=h.n5tvzjaj9llx]. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1903) Allow for RootFs to be set for mesos filesystem tasks
[ https://issues.apache.org/jira/browse/AURORA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905912#comment-15905912 ] Zameer Manji commented on AURORA-1903: -- https://reviews.apache.org/r/57524/ > Allow for RootFs to be set for mesos filesystem tasks > - > > Key: AURORA-1903 > URL: https://issues.apache.org/jira/browse/AURORA-1903 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji > Attachments: table.png > > > Currently when a TaskConfig is for a Mesos container task and it has an > image. We currently place the image as a volume mounted at {{taskfs}} in the > sandbox. Thermos, or other executors are launched outside the image and then > are expected to chroot into the {{taskfs}} directory. > However I think it would be a fine addition to allow executors to set the > {{image}} property of the Mesos container instead of putting the image as a > volume. This enables some tasks to get around a limitation of the > MesosContainerizer where certain container paths must already exist in the > image and the host. > See the > [documentation|http://mesos.apache.org/documentation/latest/docker-volume/] > for the table that describes this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1903) Allow for RootFs to be set for mesos filesystem tasks
[ https://issues.apache.org/jira/browse/AURORA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1903: - Attachment: table.png > Allow for RootFs to be set for mesos filesystem tasks > - > > Key: AURORA-1903 > URL: https://issues.apache.org/jira/browse/AURORA-1903 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji > Attachments: table.png > > > Currently when a TaskConfig is for a Mesos container task and it has an > image. We currently place the image as a volume mounted at {{taskfs}} in the > sandbox. Thermos, or other executors are launched outside the image and then > are expected to chroot into the {{taskfs}} directory. > However I think it would be a fine addition to allow executors to set the > {{image}} property of the Mesos container instead of putting the image as a > volume. This enables some tasks to get around a limitation of the > MesosContainerizer where certain container paths must already exist in the > image and the host. > See the > [documentation|http://mesos.apache.org/documentation/latest/docker-volume/] > for the table that describes this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1903) Allow for RootFs to be set for mesos filesystem tasks
Zameer Manji created AURORA-1903: Summary: Allow for RootFs to be set for mesos filesystem tasks Key: AURORA-1903 URL: https://issues.apache.org/jira/browse/AURORA-1903 Project: Aurora Issue Type: Task Reporter: Zameer Manji Assignee: Zameer Manji Currently when a TaskConfig is for a Mesos container task and it has an image. We currently place the image as a volume mounted at {{taskfs}} in the sandbox. Thermos, or other executors are launched outside the image and then are expected to chroot into the {{taskfs}} directory. However I think it would be a fine addition to allow executors to set the {{image}} property of the Mesos container instead of putting the image as a volume. This enables some tasks to get around a limitation of the MesosContainerizer where certain container paths must already exist in the image and the host. See the [documentation|http://mesos.apache.org/documentation/latest/docker-volume/] for the table that describes this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1902) Docker containers with not newest OS fails to run
[ https://issues.apache.org/jira/browse/AURORA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901918#comment-15901918 ] Zameer Manji commented on AURORA-1902: -- This is a known flaw/limitation with Mesos and the DockerConainerizer. Mesos will copy/mount the executor into the docker filesystem, meaning that the filesystem needs to be capable of launching the executor. In our case it needs to have Python 2.7 and the dependencies for libmesos. Tasks launched with the MesosContainerizer do not suffer from this limitation. > Docker containers with not newest OS fails to run > - > > Key: AURORA-1902 > URL: https://issues.apache.org/jira/browse/AURORA-1902 > Project: Aurora > Issue Type: Bug > Components: Docker, Executor >Affects Versions: 0.17.0 > Environment: Ubuntu: 16.04 > Mesos: 1.1.0 > Aurora: 0.17.0 > Dockerengine: 1.13.1 >Reporter: Mikhail Lesyk > > When trying to launch Docker containers, got an error: > {code} > I0308 21:47:56.695737 3888 fetcher.cpp:498] Fetcher Info: > {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/7cbc133f-24ac-4937-aa28-09e8c81b647b-S7","items":[{"action":"BYPASS_CACHE","uri":{"executable":true,"extract":true,"value":"\/usr\/share\/aurora\/bin\/thermos_executor.pex"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/7cbc133f-24ac-4937-aa28-09e8c81b647b-S7\/frameworks\/47934424-623f-4fcb-9326-bf668149fc77-\/executors\/thermos-root-prod-test-0-a2c19f58-aa6c-45d8-a47f-8cf57dc0c261\/runs\/788d2f72-a6eb-4f3e-999c-17158e473661"} > I0308 21:47:56.701079 3888 fetcher.cpp:409] Fetching URI > '/usr/share/aurora/bin/thermos_executor.pex' > I0308 21:47:56.701162 3888 fetcher.cpp:250] Fetching directly into the > sandbox directory > I0308 21:47:56.701225 3888 fetcher.cpp:187] Fetching URI > '/usr/share/aurora/bin/thermos_executor.pex' > I0308 21:47:56.701282 3888 fetcher.cpp:167] Copying resource with command:cp > '/usr/share/aurora/bin/thermos_executor.pex' > '/var/lib/mesos/slaves/7cbc133f-24ac-4937-aa28-09e8c81b647b-S7/frameworks/47934424-623f-4fcb-9326-bf668149fc77-/executors/thermos-root-prod-test-0-a2c19f58-aa6c-45d8-a47f-8cf57dc0c261/runs/788d2f72-a6eb-4f3e-999c-17158e473661/thermos_executor.pex' > I0308 21:47:56.730024 3888 fetcher.cpp:547] Fetched > '/usr/share/aurora/bin/thermos_executor.pex' to > '/var/lib/mesos/slaves/7cbc133f-24ac-4937-aa28-09e8c81b647b-S7/frameworks/47934424-623f-4fcb-9326-bf668149fc77-/executors/thermos-root-prod-test-0-a2c19f58-aa6c-45d8-a47f-8cf57dc0c261/runs/788d2f72-a6eb-4f3e-999c-17158e473661/thermos_executor.pex' > WARNING: Your kernel does not support swap limit capabilities or the cgroup > is not mounted. Memory limited without swap. > Traceback (most recent call last): > File "apache/aurora/executor/bin/thermos_executor_main.py", line 45, in > > from mesos.executor import MesosExecutorDriver > File > "/root/.pex/install/mesos.executor-1.1.0-py2.7-linux-x86_64.egg.47fa022c99c11c7faddf379cbfc46a25c5f215be/mesos.executor-1.1.0-py2.7-linux-x86_64.egg/mesos/executor/__init__.py", > line 17, in > from ._executor import MesosExecutorDriverImpl as MesosExecutorDriver > ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version > `GLIBCXX_3.4.20' not found (required by > /root/.pex/install/mesos.executor-1.1.0-py2.7-linux-x86_64.egg.47fa022c99c11c7faddf379cbfc46a25c5f215be/mesos.executor-1.1.0-py2.7-linux-x86_64.egg/mesos/executor/_executor.so) > twitter.common.app debug: Initializing: twitter.common.log (Logging > subsystem.) > Writing log files to disk in /mnt/mesos/sandbox > thermos_executor.pex: error: Could not load MesosExecutorDriver! > twitter.common.app debug: main sys.exited > twitter.common.app debug: Shutting application down. > twitter.common.app debug: Running exit function for twitter.common.log > (Logging subsystem.) > twitter.common.app debug: Finishing up module teardown. > twitter.common.app debug: Active thread: <_MainThread(MainThread, started > 140218447816512)> > twitter.common.app debug: Exiting cleanly. > {code} > Tested affected systems(with absent of GLIBCXX_3.4.20, GLIBCXX_3.4.21): > Debian 8 > Ubuntu 14.04 > How to reproduce: > 1) Prepare Docker image with Python 2.7. Example of Dockerfile: > {code} > FROM ubuntu:14.04 > RUN apt-get -y update && apt-get -y install python2.7 > {code} > 2) build and push image to some repo, example: > {code} > docker build -t mlesyk/ubuntu:14.04 . && docker push mlesyk/ubuntu:14.04 > {code} > 3) create some job with Docker container with any command to run, for > example, > {code} > sleep 60 > {code} > and appropriate container parameter, for example: > {code} > container = Docker(image='mlesyk/ubuntu:14.04') > {code} > 4) Run this job in Aurora and observe error from beginning of ticket -- This message was sent by Atlassian
[jira] [Commented] (AURORA-1899) Expose per role metrics around Thrift activity
[ https://issues.apache.org/jira/browse/AURORA-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894777#comment-15894777 ] Zameer Manji commented on AURORA-1899: -- I support this idea, and we can put it behind a flag like what we do for various kinds of SLA metrics. [~StephanErb]: Consider the case where a single role/user launches 30k 10k non prod tasks at the same time. You can observe the aggregate change in the current metrics, but only the logs will tell you who did it. > Expose per role metrics around Thrift activity > -- > > Key: AURORA-1899 > URL: https://issues.apache.org/jira/browse/AURORA-1899 > Project: Aurora > Issue Type: Task >Reporter: David McLaughlin > > It's currently pretty easy for a single client to cause havoc on an Aurora > cluster. We triage most of these issues by grepping the Scheduler logs for > Thrift API calls and finding patterns around role names. > Figuring out what changed would be a lot easier if we could take the current > Thrift API metrics and export an additional metric for each one that is > scoped by the role. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1887) Create Driver implementation around V0Mesos.
[ https://issues.apache.org/jira/browse/AURORA-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893209#comment-15893209 ] Zameer Manji commented on AURORA-1887: -- {noformat} commit 705dbc7cd7c3ff477bcf766cdafe49a68ab47dee Author: Zameer ManjiDate: Thu Mar 2 15:07:11 2017 -0800 Enable Mesos HTTP API. This patch completes the design doc[1] and enables operators to choose between two V1 Mesos API implementations. The first is `V0Mesos` which offers the V1 API backed by the scheduler driver and the second is `V1Mesos` which offers the V1 API backed by a new HTTP API implementation. There are three sets of changes in this patch. First, the V1 Mesos code requires a Scheduler callback with a different API. To maximize code reuse, event handling logic was extracted into a `MesosCallbackHandler` class. `VersionedMesosSchedulerImpl` was created to implement the new callback interface. Both callbacks new use the handler class for logic. Second, a new driver implementation using the new API was created. All of the logic for the new driver is encapsulated in the `VersionedSchedulerDriverService` class. Third, some wiring changes were done to allow for Guice to do it's work and allow for operators to select between the different driver implementations. [1] https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_0qMxAlnMW70eOKoU3myo Testing Done: The e2e test has been run three times, each time with a different driver option. Bugs closed: AURORA-1887, AURORA-1888 Reviewed at https://reviews.apache.org/r/57061/ RELEASE-NOTES.md | 7 + examples/vagrant/upstart/aurora-scheduler.conf | 5 +- .../aurora/benchmark/StatusUpdateBenchmark.java| 6 +- .../org/apache/aurora/scheduler/app/AppModule.java | 12 +- .../apache/aurora/scheduler/app/SchedulerMain.java | 22 +- .../scheduler/mesos/LibMesosLoadingModule.java | 29 +- .../scheduler/mesos/MesosCallbackHandler.java | 288 ++ .../aurora/scheduler/mesos/MesosSchedulerImpl.java | 212 +- .../aurora/scheduler/mesos/ProtosConversion.java | 28 ++ .../scheduler/mesos/SchedulerDriverModule.java | 50 ++- ...dingModule.java => VersionedDriverFactory.java} | 20 +- .../mesos/VersionedMesosSchedulerImpl.java | 198 ++ .../mesos/VersionedSchedulerDriverService.java | 254 .../apache/aurora/scheduler/app/SchedulerIT.java | 7 +- .../scheduler/mesos/MesosCallbackHandlerTest.java | 430 + .../scheduler/mesos/MesosSchedulerImplTest.java| 424 .../mesos/VersionedMesosSchedulerImplTest.java | 275 + .../mesos/VersionedSchedulerDriverServiceTest.java | 194 ++ .../apache/aurora/scheduler/thrift/ThriftIT.java | 3 +- 19 files changed, 1888 insertions(+), 576 deletions(-) {noformat} > Create Driver implementation around V0Mesos. > > > Key: AURORA-1887 > URL: https://issues.apache.org/jira/browse/AURORA-1887 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji > > Create an implementation of the {{org.apache.aurora.scheduler.mesos.Driver}} > interface which uses the {{V0Mesos}} shim under the hood. Provide a flag to > switch between the two to show there is no regression. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1897) Remove task length restrictions.
Zameer Manji created AURORA-1897: Summary: Remove task length restrictions. Key: AURORA-1897 URL: https://issues.apache.org/jira/browse/AURORA-1897 Project: Aurora Issue Type: Task Reporter: Zameer Manji Priority: Minor Currently we restrict the total name of a task because of a Mesos bug: {noformat} // This number is derived from the maximum file name length limit on most UNIX systems, less // the number of characters we've observed being added by mesos for the executor ID, prefix, and // delimiters. @VisibleForTesting static final int MAX_TASK_ID_LENGTH = 255 - 90; // TODO(maximk): This is a short-term hack to stop the bleeding from // https://issues.apache.org/jira/browse/MESOS-691 if (taskIdGenerator.generate(task, totalInstances).length() > MAX_TASK_ID_LENGTH) { throw new TaskValidationException( "Task ID is too long, please shorten your role or job name."); } {noformat} However [~codyg] recently [asked|https://lists.apache.org/thread.html/ca92420fe6394d6467f70989e1ffadac23775e84cf7356ff8c9efdd5@%3Cdev.mesos.apache.org%3E] on the mesos mailing list about MESOS-691 and learned that it is no longer valid. We should remove this restriction. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AURORA-1860) Fix bug in scheduler driver disconnect stats
[ https://issues.apache.org/jira/browse/AURORA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji resolved AURORA-1860. -- Resolution: Fixed {noformat} commit 2652fe02a2255992e187fede2bae8ff6aef2862c Author: Ilya ProninDate: Mon Feb 27 11:04:54 2017 -0800 Fix scheduler_framework_disconnects stat. Refactoring in r/31550 has disabled incrementing scheduler_framework_disconnects stats. This change brings it back. Testing Done: Added a check to `MesosSchedulerImplTest.testDisconnected()`. Manually verified in Vagrant by starting/stopping mesos-master and querying `/vars` endpoint. Bugs closed: AURORA-1860 Reviewed at https://reviews.apache.org/r/57074/ .../java/org/apache/aurora/scheduler/mesos/MesosSchedulerImpl.java | 2 +- .../java/org/apache/aurora/scheduler/mesos/MesosSchedulerImplTest.java | 3 ++- 2 files changed, 3 insertions(+), 2 deletions(-) {noformat} > Fix bug in scheduler driver disconnect stats > > > Key: AURORA-1860 > URL: https://issues.apache.org/jira/browse/AURORA-1860 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Mehrdad Nurolahzade >Assignee: Ilya Pronin >Priority: Minor > Labels: newbie > > Correct the refactoring mistake introduced in > [https://reviews.apache.org/r/31550/] that has disabled > {{scheduler_framework_disconnects}} stats: > {code:title=MesosSchedulerImpl.disconnected()} > counters.get("scheduler_framework_disconnects").get(); > {code} -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1890) Job Update Pulse History is initialized to no pulses on scheduler recovery
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1890: - Description: I have experienced the following problem with pulse updates. To reproduce: 1. Create an update with a pulse timeout of 1h 2. Send a pulse to get the update going. 3. Failover the scheduler immediately after. 4. Observe that the update is awaiting another pulse right after the failover. This is because the {{JobUpdateControllerImpl}} stores pulse history and state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is reset to no pulse received. We can solve this by inferring the timestamp of the last pulse by inspecting the job update events. was: I have experienced the following problem with pulse updates. To reproduce: 1. Create an update with a pulse timeout of 1h 2. Send a pulse to get the update going. 3. Failover the scheduler immediately after. 4. Observe that the update is awaiting another pulse right after the failover. This is because the {{JobUpdateControllerImpl}} stores pulse history and state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is reset to no pulse received. We can solve this by durably storing the timestamp of the last pulse received in storage. > Job Update Pulse History is initialized to no pulses on scheduler recovery > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by inferring the timestamp of the last pulse by inspecting > the job update events. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1890) Job Update Pulse History is initialized to no pulses on scheduler recovery
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1890: Assignee: Zameer Manji > Job Update Pulse History is initialized to no pulses on scheduler recovery > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by inferring the timestamp of the last pulse by inspecting > the job update events. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1890) Job Update Pulse History is initialized to no pulses on scheduler recovery
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1890: - Summary: Job Update Pulse History is initialized to no pulses on scheduler recovery (was: Job Update Pulse History is not durably stored) > Job Update Pulse History is initialized to no pulses on scheduler recovery > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1891) Unable to upgrade Guava
Zameer Manji created AURORA-1891: Summary: Unable to upgrade Guava Key: AURORA-1891 URL: https://issues.apache.org/jira/browse/AURORA-1891 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Priority: Minor Guava 21 is out and with better Java 8 integration. I cannot upgrade us. Bumping the dependency results in: {noformat} /Users/zmanji/code/aurora/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:82: error: cannot find symbol class WriteAheadStorage extends WriteAheadStorageForwarder implements ^ symbol: class WriteAheadStorageForwarder /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith': class file for com.google.errorprone.annotations.CompatibleWith not found /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multiset.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multiset.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/code/aurora/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:74: Note: Wrote forwarder org.apache.aurora.scheduler.storage.log.WriteAheadStorageForwarder @Forward({ ^ /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith': class file for com.google.errorprone.annotations.CompatibleWith not found /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith' /Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class): warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864598#comment-15864598 ] Zameer Manji commented on AURORA-1890: -- I would be content with initializing the {{PulseState}} timestamp with the timestamp of the most recent event that transitioned from a {{BLOCKED_AWAITING_PULSE}}. I feel this is more correct than what we do now, avoids hashing out some storage changes, and is suitable for my current usecase. If you confirm that you agree, I can rephrase this ticket to better capture what the fix would be. > Job Update Pulse History is not durably stored > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored
[ https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864537#comment-15864537 ] Zameer Manji commented on AURORA-1890: -- The scheduler does the right thing on first pulse. However on failover, any coordinated updates are immediately sent to BLOCKED_AWAITING_PULSE. This is because on scheduler startup pulse state is reset to no pulse received. The code sets the timestamp to the last pulse received to 0L: {noformat} synchronized void initializePulseState(IJobUpdate update, JobUpdateStatus status) { pulseStates.put(update.getSummary().getKey(), new PulseState( status, update.getInstructions().getSettings().getBlockIfNoPulsesAfterMs(), 0L)); } {noformat} Would it be ok to set the timestamp to the first event after the most recent {{BLOCKED_AWAITING_PULSE}}? We know for sure at that point in time that a pulse was received because of the state transition from {{BLCOKED_AWAITING_PULSE}} to some other event. Also could you describe "significant" write volume? I can imagine if the pulse interval was in the seconds and there are thousands of updates perhaps it would be too much. However we could prevent excessively small pulse intervals. > Job Update Pulse History is not durably stored > -- > > Key: AURORA-1890 > URL: https://issues.apache.org/jira/browse/AURORA-1890 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > > I have experienced the following problem with pulse updates. To reproduce: > 1. Create an update with a pulse timeout of 1h > 2. Send a pulse to get the update going. > 3. Failover the scheduler immediately after. > 4. Observe that the update is awaiting another pulse right after the failover. > This is because the {{JobUpdateControllerImpl}} stores pulse history and > state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is > reset to no pulse received. > We can solve this by durably storing the timestamp of the last pulse received > in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1890) Job Update Pulse History is not durably stored
Zameer Manji created AURORA-1890: Summary: Job Update Pulse History is not durably stored Key: AURORA-1890 URL: https://issues.apache.org/jira/browse/AURORA-1890 Project: Aurora Issue Type: Bug Reporter: Zameer Manji I have experienced the following problem with pulse updates. To reproduce: 1. Create an update with a pulse timeout of 1h 2. Send a pulse to get the update going. 3. Failover the scheduler immediately after. 4. Observe that the update is awaiting another pulse right after the failover. This is because the {{JobUpdateControllerImpl}} stores pulse history and state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is reset to no pulse received. We can solve this by durably storing the timestamp of the last pulse received in storage. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AURORA-1846) Add message parameter to killTasks RPC
[ https://issues.apache.org/jira/browse/AURORA-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji resolved AURORA-1846. -- Resolution: Fixed This is fixed on master: {noformat} commit f88b7f3bf5b7a7db6e422e38cbf22cf809f8ff87 Author: Cody GibbDate: Mon Feb 6 10:43:01 2017 -0800 Add message parameter to killTasks RPC's such as pauseJobUpdate include a parameter for "a user-specified message to include with the induced job update state change." This diff provides a similar optional parameter for the killTasks RPC, which allows users to indicate the reason why a task was killed, and later inspect that reason when consuming task events. Example usage from Aurora CLI: `$ aurora job killall devcluster/www-data/prod/hello --message "Some message"` In the task event, the supplied message (if provided) is appended to the existing template "Killed by ", separated by a newline. For the above example, this looks like: "Killed by aurora\nSome message". Testing Done: Added a unit test in the scheduler, and a test in the client. Also manually tested using the Vagrant environment. Bugs closed: AURORA-1846 Reviewed at https://reviews.apache.org/r/54459/ RELEASE-NOTES.md | 7 +++ .../main/thrift/org/apache/aurora/gen/api.thrift | 2 +- .../aurora/scheduler/thrift/AuditMessages.java | 6 ++- .../scheduler/thrift/SchedulerThriftInterface.java | 8 +++- .../scheduler/thrift/aop/AnnotatedAuroraAdmin.java | 3 +- .../python/apache/aurora/client/api/__init__.py| 4 +- src/main/python/apache/aurora/client/cli/jobs.py | 10 +++-- .../apache/aurora/client/hooks/hooked_api.py | 9 ++-- .../http/api/security/HttpSecurityIT.java | 21 - .../ShiroAuthorizingParamInterceptorTest.java | 4 +- .../aurora/scheduler/thrift/AuditMessagesTest.java | 26 ++- .../thrift/SchedulerThriftInterfaceTest.java | 27 +--- src/test/python/apache/aurora/api_util.py | 2 +- .../aurora/client/api/test_scheduler_client.py | 10 ++--- .../python/apache/aurora/client/cli/test_kill.py | 50 -- .../apache/aurora/client/hooks/test_hooked_api.py | 2 +- .../aurora/client/hooks/test_non_hooked_api.py | 6 +-- .../sh/org/apache/aurora/e2e/test_end_to_end.sh| 10 - 18 files changed, 146 insertions(+), 61 deletions(-) {noformat} > Add message parameter to killTasks RPC > -- > > Key: AURORA-1846 > URL: https://issues.apache.org/jira/browse/AURORA-1846 > Project: Aurora > Issue Type: Task > Components: Client, Scheduler >Affects Versions: 0.16.0 >Reporter: Cody Gibb >Assignee: Cody Gibb >Priority: Minor > > RPC's such as pauseJobUpdate include a parameter for "a user-specified > message to include with the induced job update state change." Having a > similar parameter for killTasks would allow us to indicate the reason why a > task was killed, and later inspect that reason when querying > getTasksWithoutConfigs. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AURORA-1886) Migrate Aurora to use V1 protobufs
[ https://issues.apache.org/jira/browse/AURORA-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1886: - Issue Type: Task (was: Story) > Migrate Aurora to use V1 protobufs > -- > > Key: AURORA-1886 > URL: https://issues.apache.org/jira/browse/AURORA-1886 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji > > To migrate to the V1 API, Aurora needs to start using the V1 protobufs. > The Driver interface and Scheduler callback from mesos will accept > unversioned protobufs and convert them when required. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1886) Migrate Aurora to use V1 protobufs
Zameer Manji created AURORA-1886: Summary: Migrate Aurora to use V1 protobufs Key: AURORA-1886 URL: https://issues.apache.org/jira/browse/AURORA-1886 Project: Aurora Issue Type: Story Reporter: Zameer Manji To migrate to the V1 API, Aurora needs to start using the V1 protobufs. The Driver interface and Scheduler callback from mesos will accept unversioned protobufs and convert them when required. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AURORA-1885) Support the Mesos V1 API
[ https://issues.apache.org/jira/browse/AURORA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1885: Assignee: Zameer Manji > Support the Mesos V1 API > > > Key: AURORA-1885 > URL: https://issues.apache.org/jira/browse/AURORA-1885 > Project: Aurora > Issue Type: Epic >Reporter: Zameer Manji >Assignee: Zameer Manji > > This ticket tracks the work outlined in the design doc: > https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_0qMxAlnMW70eOKoU3myo/edit#heading=h.itk6ht9i1yha -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AURORA-1885) Support the Mesos V1 API
Zameer Manji created AURORA-1885: Summary: Support the Mesos V1 API Key: AURORA-1885 URL: https://issues.apache.org/jira/browse/AURORA-1885 Project: Aurora Issue Type: Epic Reporter: Zameer Manji This ticket tracks the work outlined in the design doc: https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_0qMxAlnMW70eOKoU3myo/edit#heading=h.itk6ht9i1yha -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AURORA-1669) Kill twitter/commons ZK libs when Curator replacements are vetted
[ https://issues.apache.org/jira/browse/AURORA-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828697#comment-15828697 ] Zameer Manji commented on AURORA-1669: -- I'm unable to complete the diff, I'm hoping [~jsirois] can guide it to completion. > Kill twitter/commons ZK libs when Curator replacements are vetted > - > > Key: AURORA-1669 > URL: https://issues.apache.org/jira/browse/AURORA-1669 > Project: Aurora > Issue Type: Task >Reporter: John Sirois >Assignee: John Sirois > Fix For: 0.17.0 > > > Once we have reports from production users that the Curator zk plumbing > introduced in AURORA-1468 is working well, the {{-zk_use_curator}} flag > should be deprecated and then the flag and commons code killed. If the > vetting happens before the next release ({{0.14.0}}), we can dispense with a > deprecation cycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1799) Thermos does not handle low memory scenarios gracefully
[ https://issues.apache.org/jira/browse/AURORA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827017#comment-15827017 ] Zameer Manji commented on AURORA-1799: -- Today [~benley] reported something similar in Slack: {noformat} ERROR] Failed to stop health checkers: ERROR] Traceback (most recent call last): File "apache/aurora/executor/aurora_executor.py", line 192, in _shutdown propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT) File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline return deadline(*args, daemon=True, propagate=True, **kw) File "/root/.pex/install/twitter.common.concurrent-0.3.3-py2-none-any.whl.33d9c24da69d7478b4aa6d76f474f3773a61f6f9/twitter.common.concurrent-0.3.3-py2-none-any.whl/twitter/common/concurrent/dead line.py", line 61, in deadline AnonymousThread().start() File "/usr/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) error: can't start new thread ERROR] Failed to stop runner: ERROR] Traceback (most recent call last): File "apache/aurora/executor/aurora_executor.py", line 200, in _shutdown propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT) File "apache/aurora/executor/aurora_executor.py", line 35, in propagate_deadline return deadline(*args, daemon=True, propagate=True, **kw) File "/root/.pex/install/twitter.common.concurrent-0.3.3-py2-none-any.whl.33d9c24da69d7478b4aa6d76f474f3773a61f6f9/twitter.common.concurrent-0.3.3-py2-none-any.whl/twitter/common/concurrent/dead line.py", line 61, in deadline AnonymousThread().start() File "/usr/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) error: can't start new thread Traceback (most recent call last): File "/root/.pex/install/twitter.common.exceptions-0.3.3-py2-none-any.whl.57572b1f0a301c36c91adf2c704d0e8dd4d48429/twitter.common.exceptions-0.3.3-py2-none-any.whl/twitter/common/exceptions/__in it__.py", line 126, in _excepting_run self.__real_run(*args, **kw) File "apache/aurora/executor/status_manager.py", line 50, in run File "apache/aurora/executor/aurora_executor.py", line 218, in _shutdown File "/root/.pex/install/twitter.common.concurrent-0.3.3-py2-none-any.whl.33d9c24da69d7478b4aa6d76f474f3773a61f6f9/twitter.common.concurrent-0.3.3-py2-none-any.whl/twitter/common/concurrent/defe rred.py", line 56, in defer deferred.start() File "/usr/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread Traceback (most recent call last): File "/root/.pex/install/twitter.common.exceptions-0.3.3-py2-none-any.whl.57572b1f0a301c36c91adf2c704d0e8dd4d48429/twitter.common.exceptions-0.3.3-py2-none-any.whl/twitter/common/exceptions/__in it__.py", line 126, in _excepting_run self.__real_run(*args, **kw) File "apache/thermos/monitoring/resource.py", line 239, in run File "/root/.pex/install/twitter.common.concurrent-0.3.3-py2-none-any.whl.33d9c24da69d7478b4aa6d76f474f3773a61f6f9/twitter.common.concurrent-0.3.3-py2-none-any.whl/twitter/common/concurrent/even t_muxer.py", line 79, in wait thread.start() File "/usr/lib/python2.7/threading.py", line 745, in start _start_new_thread(self.__bootstrap, ()) thread.error: can't start new thread E0116 20:46:46.56877534 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107] E0116 20:46:51.78901634 socket.hpp:174] Shutdown failed on fd=14: Transport endpoint is not connected [107] E0116 20:50:47.90499934 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107] E0116 20:50:48.09745734 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107] E0116 20:50:50.27705334 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107] E0116 20:50:51.00681634 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107] E0116 20:50:51.02212334 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107] E0116 20:50:51.24417934 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107] E0116 20:50:55.40700634 socket.hpp:174] Shutdown failed on fd=14: Transport endpoint is not connected [107] E0116 20:50:55.41075934 socket.hpp:174] Shutdown failed on fd=15: Transport endpoint is not connected [107] E0116 20:50:56.70334834 socket.hpp:174] Shutdown failed on fd=14: Transport endpoint is not connected [107] E0116 20:50:56.70747134 socket.hpp:174] Shutdown failed on fd=15: Transport endpoint is not connected [107] E0116 20:50:56.71240634 socket.hpp:174] Shutdown failed on fd=16: Transport endpoint is not connected [107] E0116 20:50:57.05304534 socket.hpp:174] Shutdown failed on fd=14: Transport endpoint is not
[jira] [Commented] (AURORA-1858) Expose stats on offers known to scheduler
[ https://issues.apache.org/jira/browse/AURORA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749644#comment-15749644 ] Zameer Manji commented on AURORA-1858: -- Isn't this what the "outstanding_offers" metric is? > Expose stats on offers known to scheduler > - > > Key: AURORA-1858 > URL: https://issues.apache.org/jira/browse/AURORA-1858 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Mehrdad Nurolahzade >Priority: Minor > Labels: newbie > > Expose stats on the number of offers tracked by {{OfferManager}}. This can > simply be defined as a collection size gauge on {{offers}} set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1806) Enhance Aurora KILLED message for tasks killed for update.
[ https://issues.apache.org/jira/browse/AURORA-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1806: - Assignee: Abhishek Jain > Enhance Aurora KILLED message for tasks killed for update. > -- > > Key: AURORA-1806 > URL: https://issues.apache.org/jira/browse/AURORA-1806 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Zameer Manji >Assignee: Abhishek Jain >Priority: Trivial > Labels: newbie > > Right now if a task is killed for an update the message in the UI and task > storage is "Killed for job update.". > This should be enhanced to include the update id. > Currently, I see the timestamp of the kill and then look at the update > history to see which update caused it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1806) Enhance Aurora KILLED message for tasks killed for update.
[ https://issues.apache.org/jira/browse/AURORA-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734040#comment-15734040 ] Zameer Manji commented on AURORA-1806: -- Done. > Enhance Aurora KILLED message for tasks killed for update. > -- > > Key: AURORA-1806 > URL: https://issues.apache.org/jira/browse/AURORA-1806 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Zameer Manji >Assignee: Abhishek Jain >Priority: Trivial > Labels: newbie > > Right now if a task is killed for an update the message in the UI and task > storage is "Killed for job update.". > This should be enhanced to include the update id. > Currently, I see the timestamp of the kill and then look at the update > history to see which update caused it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1847) Eliminate sequential scan in MemTaskStore.getJobKeys()
[ https://issues.apache.org/jira/browse/AURORA-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726960#comment-15726960 ] Zameer Manji commented on AURORA-1847: -- Could this be resolved by moving to {{DBTaskStore}} or does that have too many drawbacks? > Eliminate sequential scan in MemTaskStore.getJobKeys() > -- > > Key: AURORA-1847 > URL: https://issues.apache.org/jira/browse/AURORA-1847 > Project: Aurora > Issue Type: Story > Components: Efficiency, UI >Reporter: Mehrdad Nurolahzade >Priority: Minor > Labels: newbie > > The existing {{TaskStoreBenchmarks}} shows {{DBTaskStore}} is almost two > orders of magnitude faster than {{MemTaskStore}} when it comes to > {{getJobKeys()}}: > {code} > Benchmark (numTasks) Mode Cnt > Score Error Units > TaskStoreBenchmarks.DBFetchTasksBenchmark.run1 thrpt5 > 78430.531 ± 3255.027 ops/s > TaskStoreBenchmarks.DBFetchTasksBenchmark.run5 thrpt5 > 50774.988 ± 8986.951 ops/s > TaskStoreBenchmarks.DBFetchTasksBenchmark.run 10 thrpt5 > 2480.074 ± 9833.122 ops/s > TaskStoreBenchmarks.MemFetchTasksBenchmark.run 1 thrpt5 > 1189.568 ± 108.146 ops/s > TaskStoreBenchmarks.MemFetchTasksBenchmark.run 5 thrpt5 > 124.990 ± 27.605 ops/s > TaskStoreBenchmarks.MemFetchTasksBenchmark.run 10 thrpt5 > 35.724 ± 15.101 ops/s > {code} > If scheduler is configured to run with the {{MemTaskStore}} every hit on > scheduler page ({{/scheduler}}) causes a call to > {{MemTaskStore.getJobKeys()}}. > The implementation of this method is currently very inefficient as it results > in a sequential scan of the task store and then mapping to their respective > job keys. The sequential scan and mapping to job key can be eliminated by > simply returning the key set of the existing secondary index {{job}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING
[ https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723217#comment-15723217 ] Zameer Manji commented on AURORA-1823: -- Although I think that the {{createJob}} API should use multiple threads to move a job's tasks into PENDING, benchmarking shows logging is still the slowest part. There was a good performance improvement in https://github.com/apache/aurora/commit/4bc5246149f296b14dc520bedd71747fdb2578fb so I think I'm just going to close this for now. > `createJob` API uses single thread to move all tasks to PENDING > > > Key: AURORA-1823 > URL: https://issues.apache.org/jira/browse/AURORA-1823 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji >Priority: Minor > > If you create a single job with many tasks (lets say 10k+) the `createJob` > API will take a long time. This is because the `createJob` API only returns > when all of the tasks have moved to PENDING and it uses a single thread to do > so. Here is a snippet of the logs: > {noformat} > ... > I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > state machine transition INIT -> PENDING > I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > state machine transition INIT -> PENDING > I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > state machine transition INIT -> PENDING > I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > state machine transition INIT -> PENDING > I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > state machine transition INIT -> PENDING > I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > state machine transition INIT -> PENDING > I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > state machine transition INIT -> PENDING > I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > ... > {noformat} > Observe that a single jetty thread is doing this. > We should leverage {{BatchWorker}} to have concurrent mutations here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1844) Force a snapshot at the end of Scheduler startup.
[ https://issues.apache.org/jira/browse/AURORA-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15716903#comment-15716903 ] Zameer Manji commented on AURORA-1844: -- This might be a dupe of AURORA-1812 > Force a snapshot at the end of Scheduler startup. > - > > Key: AURORA-1844 > URL: https://issues.apache.org/jira/browse/AURORA-1844 > Project: Aurora > Issue Type: Task >Reporter: Santhosh Kumar Shanmugham >Priority: Minor > > When the scheduler starts up, it replays the logs from the replicated log to > catch up with the current state, before announcing itself as the leader to > the outside world. If for any reason after this replay, the scheduler dies > after adding more log entires, the next startup will have to redo the work > again. This becomes problem when the amount of additional work added is not > trivial, and can take the scheduler down the path of a spiraling death. One > example, of this is when the TaskHistoryPruner cleans up the DB but adds to > the log entires. In order to avoid the repeated work, the scheduler should > force a snapshot after the initial replay. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1831) Tweak logging pattern to improve performance
[ https://issues.apache.org/jira/browse/AURORA-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1831: Assignee: Zameer Manji > Tweak logging pattern to improve performance > > > Key: AURORA-1831 > URL: https://issues.apache.org/jira/browse/AURORA-1831 > Project: Aurora > Issue Type: Task > Components: Efficiency >Reporter: Mehrdad Nurolahzade >Assignee: Zameer Manji >Priority: Minor > Labels: newbie > > The choice of logging pattern can have an impact on the system performance. > Using expensive patterns like class name or line number is discouraged for > performance critical systems like Aurora. > A recent experiment with the task state machine benchmark revealed ~2x > performance improvement when class name and file number patterns were > removed. Tweak Aurora's default logging pattern to improve logging > performance. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1669) Kill twitter/commons ZK libs when Curator replacements are vetted
[ https://issues.apache.org/jira/browse/AURORA-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712330#comment-15712330 ] Zameer Manji commented on AURORA-1669: -- Here is my assessment on how to fix AURORA-1840: * We cannot upgrade to Curator 3.x because it will only work with ZK 3.5.x which has not been released yet. * We can move to the {{LeaderSelector}} recipe (per [~StephanErb]'s suggestion) and figure out how to make it backwards compatible for leader discovery. * We can figure out how to override the error handling capability of {{LeaderLatch}} to have it not lose leadership on session suspension, only loss. > Kill twitter/commons ZK libs when Curator replacements are vetted > - > > Key: AURORA-1669 > URL: https://issues.apache.org/jira/browse/AURORA-1669 > Project: Aurora > Issue Type: Task >Reporter: John Sirois >Assignee: John Sirois > > Once we have reports from production users that the Curator zk plumbing > introduced in AURORA-1468 is working well, the {{-zk_use_curator}} flag > should be deprecated and then the flag and commons code killed. If the > vetting happens before the next release ({{0.14.0}}), we can dispense with a > deprecation cycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (AURORA-1669) Kill twitter/commons ZK libs when Curator replacements are vetted
[ https://issues.apache.org/jira/browse/AURORA-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reopened AURORA-1669: -- Re-opening this because of AURORA-1840 Per [~jsirois]'s suggestion we may need to upgrade [Curator|https://issues.apache.org/jira/browse/AURORA-1840?focusedCommentId=15712226=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15712226]. > Kill twitter/commons ZK libs when Curator replacements are vetted > - > > Key: AURORA-1669 > URL: https://issues.apache.org/jira/browse/AURORA-1669 > Project: Aurora > Issue Type: Task >Reporter: John Sirois >Assignee: John Sirois > > Once we have reports from production users that the Curator zk plumbing > introduced in AURORA-1468 is working well, the {{-zk_use_curator}} flag > should be deprecated and then the flag and commons code killed. If the > vetting happens before the next release ({{0.14.0}}), we can dispense with a > deprecation cycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load
[ https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712240#comment-15712240 ] Zameer Manji commented on AURORA-1840: -- +1 This seems identical to the behaviour of the previous implementation. > Issue with Curator-backed discovery under heavy load > > > Key: AURORA-1840 > URL: https://issues.apache.org/jira/browse/AURORA-1840 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Reporter: David McLaughlin >Assignee: David McLaughlin >Priority: Blocker > Fix For: 0.17.0 > > > We've been having some performance issues recently with our production > clusters at Twitter. A side-effect of these are occassional stop-the-world GC > pauses for up to 15 seconds. This has been happening at our scale for quite > some time, but previous versions of the Scheduler were resilient to this and > no leadership change would occur. > Since we moved to Curator, we are no longer resilient to these GC pauses. The > Scheduler is now failing over any time we see a GC pause, even though these > pauses are within the session timeout. Here is an example pause in the > scheduler logs with the associated ZK session timeout that leads to a > failover: > {code} > I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took > 586236ns > I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] > redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING > -> ASSIGNED > I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work > command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 > I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] > Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is > being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. > W1118 19:40:31.744 [main-SendThread(redacted:2181), > ClientCnxn$SendThread:1108] Client session timed out, have not heard from > server in 20743ms for sessionid 0x6584fd2b34ede86 > {code} > As you can see from the timestamps, there was a 15s GC pause (confirmed in > our GC logs - a CMS promotion failure caused the pause) and this triggers a > session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s > cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK > timeout, so our session timeout is being wired through fine. > We have confirmed that the Scheduler no longer fails over when deploying from > HEAD with these two commits reverted and setting zk_use_curator to false: > https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f > https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5 > This is a pretty big blocker for us given how expensive Scheduler failovers > are (currently several minutes for us). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load
[ https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712224#comment-15712224 ] Zameer Manji commented on AURORA-1840: -- I don't object to reverting this until some analysis can be done. > Issue with Curator-backed discovery under heavy load > > > Key: AURORA-1840 > URL: https://issues.apache.org/jira/browse/AURORA-1840 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Reporter: David McLaughlin >Assignee: David McLaughlin >Priority: Blocker > Fix For: 0.17.0 > > > We've been having some performance issues recently with our production > clusters at Twitter. A side-effect of these are occassional stop-the-world GC > pauses for up to 15 seconds. This has been happening at our scale for quite > some time, but previous versions of the Scheduler were resilient to this and > no leadership change would occur. > Since we moved to Curator, we are no longer resilient to these GC pauses. The > Scheduler is now failing over any time we see a GC pause, even though these > pauses are within the session timeout. Here is an example pause in the > scheduler logs with the associated ZK session timeout that leads to a > failover: > {code} > I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took > 586236ns > I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] > redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING > -> ASSIGNED > I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work > command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 > I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] > Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is > being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. > W1118 19:40:31.744 [main-SendThread(redacted:2181), > ClientCnxn$SendThread:1108] Client session timed out, have not heard from > server in 20743ms for sessionid 0x6584fd2b34ede86 > {code} > As you can see from the timestamps, there was a 15s GC pause (confirmed in > our GC logs - a CMS promotion failure caused the pause) and this triggers a > session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s > cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK > timeout, so our session timeout is being wired through fine. > We have confirmed that the Scheduler no longer fails over when deploying from > HEAD with these two commits reverted and setting zk_use_curator to false: > https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f > https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5 > This is a pretty big blocker for us given how expensive Scheduler failovers > are (currently several minutes for us). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1834) Expose stats on undelivered event bus events
[ https://issues.apache.org/jira/browse/AURORA-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706300#comment-15706300 ] Zameer Manji commented on AURORA-1834: -- This is a good idea, we should count this much like how we count uncaught exceptions in the scheduling loop. It would be good to alert on and can track regressions. > Expose stats on undelivered event bus events > > > Key: AURORA-1834 > URL: https://issues.apache.org/jira/browse/AURORA-1834 > Project: Aurora > Issue Type: Task > Components: Scheduler >Reporter: Mehrdad Nurolahzade >Priority: Minor > Labels: newbie > > {{DeadEvent}} is a wrapper for an event that was posted, but which had no > subscribers and thus could not be delivered. {{PubSubEventModule}} is > currently utilizing a {{DeadEventHandler}} for logging such events but it > should additionally expose stats. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1825) Enable async logging by default
[ https://issues.apache.org/jira/browse/AURORA-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691758#comment-15691758 ] Zameer Manji commented on AURORA-1825: -- Locally I removed the expensive parts of our logback config with: {noformat} diff --git c/src/main/resources/logback.xml w/src/main/resources/logback.xml index 84c175c..6206806 100644 --- c/src/main/resources/logback.xml +++ w/src/main/resources/logback.xml @@ -23,7 +23,7 @@ limitations under the License. System.err -%.-1level%date{MMdd HH:mm:ss.SSS} [%thread, %class{0}:%line] %message %xThrowable%n +%.-1level%date{MMdd HH:mm:ss.SSS} [%thread] %message %xThrowable%n {noformat} Before: {noformat} Benchmark (numPendingTasks) (numTasksToDelete) Mode Cnt Score Error Units StateManagerBenchmarks.DeleteTasksBenchmark.run N/A 1000 thrpt 10 2.510 ± 0.557 ops/s StateManagerBenchmarks.DeleteTasksBenchmark.run N/A 1 thrpt 10 0.272 ± 0.030 ops/s StateManagerBenchmarks.DeleteTasksBenchmark.run N/A 5 thrpt 10 0.053 ± 0.011 ops/s StateManagerBenchmarks.InsertPendingTasksBenchmark.run 1000 N/A thrpt 10 2.446 ± 0.698 ops/s StateManagerBenchmarks.InsertPendingTasksBenchmark.run 1 N/A thrpt 10 0.246 ± 0.018 ops/s StateManagerBenchmarks.InsertPendingTasksBenchmark.run 5 N/A thrpt 10 0.041 ± 0.006 ops/s {noformat} After: {noformat} Benchmark (numPendingTasks) (numTasksToDelete) Mode Cnt Score Error Units StateManagerBenchmarks.DeleteTasksBenchmark.run N/A 1000 thrpt 10 8.640 ± 1.431 ops/s StateManagerBenchmarks.DeleteTasksBenchmark.run N/A 1 thrpt 10 0.892 ± 0.066 ops/s StateManagerBenchmarks.DeleteTasksBenchmark.run N/A 5 thrpt 10 0.172 ± 0.010 ops/s StateManagerBenchmarks.InsertPendingTasksBenchmark.run 1000 N/A thrpt 10 4.837 ± 1.511 ops/s StateManagerBenchmarks.InsertPendingTasksBenchmark.run 1 N/A thrpt 10 0.510 ± 0.315 ops/s StateManagerBenchmarks.InsertPendingTasksBenchmark.run 5 N/A thrpt 10 0.079 ± 0.052 ops/s {noformat} I picked this benchmark because it logs a lot in the critical path. We could probably fix this problem by removing line number and removing class name with the logger name. The net result would be no line numbers but way faster logging. > Enable async logging by default > --- > > Key: AURORA-1825 > URL: https://issues.apache.org/jira/browse/AURORA-1825 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Jing Chen >Priority: Minor > > Based on my experience while working on AURORA-1823 and [~StephanErb]'s work > on logging recently, I think it would be best if we enabled async logging. > For example if one attempts to parallelize the work inside > {{StateManagerImpl}} there isn't much benefit because all of the state > transitions are logged and all of the threads would contend for the lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1827) Fix SLA percentile calculation
[ https://issues.apache.org/jira/browse/AURORA-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691040#comment-15691040 ] Zameer Manji commented on AURORA-1827: -- I upgraded us to Guava 20. It has a [Quantiles|http://google.github.io/guava/releases/20.0/api/docs/com/google/common/math/Quantiles.html] class and a [Stats|http://google.github.io/guava/releases/20.0/api/docs/com/google/common/math/Stats.html] class that could be very helpful here. > Fix SLA percentile calculation > --- > > Key: AURORA-1827 > URL: https://issues.apache.org/jira/browse/AURORA-1827 > Project: Aurora > Issue Type: Story >Reporter: Reza Motamedi >Priority: Trivial > Labels: newbie, sla > > The calculation of mttX (median-time-to-X) depends on the computation of > percentile values. The current implementation does not behave nicely with a > small sample size. For instance, for a given sample set of {50, 150}, > 50-percentile is reported to be 50. Although, 100 seems a more appropriate > return value. > One solution is to modify `SlaUtil` to perform an extrapolation when the > sample size is small or when the corresponding index to a percentile value is > not an integer. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING
[ https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15688685#comment-15688685 ] Zameer Manji commented on AURORA-1823: -- Benchmarks for {{StateManagerImpl}} to validate any changes: https://reviews.apache.org/r/54011/ > `createJob` API uses single thread to move all tasks to PENDING > > > Key: AURORA-1823 > URL: https://issues.apache.org/jira/browse/AURORA-1823 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji >Priority: Minor > > If you create a single job with many tasks (lets say 10k+) the `createJob` > API will take a long time. This is because the `createJob` API only returns > when all of the tasks have moved to PENDING and it uses a single thread to do > so. Here is a snippet of the logs: > {noformat} > ... > I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > state machine transition INIT -> PENDING > I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > state machine transition INIT -> PENDING > I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > state machine transition INIT -> PENDING > I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > state machine transition INIT -> PENDING > I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > state machine transition INIT -> PENDING > I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > state machine transition INIT -> PENDING > I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > state machine transition INIT -> PENDING > I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > ... > {noformat} > Observe that a single jetty thread is doing this. > We should leverage {{BatchWorker}} to have concurrent mutations here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1825) Enable async logging by default
[ https://issues.apache.org/jira/browse/AURORA-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15688506#comment-15688506 ] Zameer Manji commented on AURORA-1825: -- We could achieve this by changing {{logback.xml}} to use this: http://logback.qos.ch/manual/appenders.html#AsyncAppender > Enable async logging by default > --- > > Key: AURORA-1825 > URL: https://issues.apache.org/jira/browse/AURORA-1825 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Priority: Minor > > Based on my experience while working on AURORA-1823 and [~StephanErb]'s work > on logging recently, I think it would be best if we enabled async logging. > For example if one attempts to parallelize the work inside > {{StateManagerImpl}} there isn't much benefit because all of the state > transitions are logged and all of the threads would contend for the lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1825) Enable async logging by default
Zameer Manji created AURORA-1825: Summary: Enable async logging by default Key: AURORA-1825 URL: https://issues.apache.org/jira/browse/AURORA-1825 Project: Aurora Issue Type: Task Reporter: Zameer Manji Priority: Minor Based on my experience while working on AURORA-1823 and [~StephanErb]'s work on logging recently, I think it would be best if we enabled async logging. For example if one attempts to parallelize the work inside {{StateManagerImpl}} there isn't much benefit because all of the state transitions are logged and all of the threads would contend for the lock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING
[ https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1823: Assignee: Zameer Manji > `createJob` API uses single thread to move all tasks to PENDING > > > Key: AURORA-1823 > URL: https://issues.apache.org/jira/browse/AURORA-1823 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji >Priority: Minor > > If you create a single job with many tasks (lets say 10k+) the `createJob` > API will take a long time. This is because the `createJob` API only returns > when all of the tasks have moved to PENDING and it uses a single thread to do > so. Here is a snippet of the logs: > {noformat} > ... > I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > state machine transition INIT -> PENDING > I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > state machine transition INIT -> PENDING > I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > state machine transition INIT -> PENDING > I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > state machine transition INIT -> PENDING > I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > state machine transition INIT -> PENDING > I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > state machine transition INIT -> PENDING > I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > state machine transition INIT -> PENDING > I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > ... > {noformat} > Observe that a single jetty thread is doing this. > We should leverage {{BatchWorker}} to have concurrent mutations here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create
[ https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684360#comment-15684360 ] Zameer Manji commented on AURORA-1014: -- [~StephanErb] [~santhk] Can we close this ticket and make a new one for Mesos images? > Client binding_helper to resolve docker label to a stable ID at create > -- > > Key: AURORA-1014 > URL: https://issues.apache.org/jira/browse/AURORA-1014 > Project: Aurora > Issue Type: Story > Components: Client, Packaging >Reporter: Kevin Sweeney >Assignee: Santhosh Kumar Shanmugham > Fix For: 0.17.0 > > > Follow-up from discussion on IRC: > Some docker labels are mutable, meaning the image a task runs in could change > from restart to restart even if the rest of the task config doesn't change. > This breaks assumptions that make rolling updates the safe and preferred way > to deploy a new Aurora job > Add a binding helper that resolves a docker label to an immutable image > identifier at create time and make it the default for the Docker helper > introduced in https://reviews.apache.org/r/28920/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING
[ https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679664#comment-15679664 ] Zameer Manji commented on AURORA-1823: -- Upon some further analysis {{BatchWorker}} might not help us here. After some JMH benchmarking and profiling, the biggest problem with {{insertPendingTasks}} is that it doesn't use the bulk storage API {{saveTasks}}. Instead it calls {{mutateTask}} for every task that is moving to {{PENDING}}. I can get a 10x+ improvement in throughput by simply queueing up mutations and side effects that are a result of the state machine and then calling {{saveTasks}} once all of the mutations have been computed. I'm going to look into refactoring {{StateManagerImpl}} to support evaluating multiple task state machine concurrently and then merging all of the side effects from those state machines into a single operation. > `createJob` API uses single thread to move all tasks to PENDING > > > Key: AURORA-1823 > URL: https://issues.apache.org/jira/browse/AURORA-1823 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Priority: Minor > > If you create a single job with many tasks (lets say 10k+) the `createJob` > API will take a long time. This is because the `createJob` API only returns > when all of the tasks have moved to PENDING and it uses a single thread to do > so. Here is a snippet of the logs: > {noformat} > ... > I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > state machine transition INIT -> PENDING > I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > state machine transition INIT -> PENDING > I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > state machine transition INIT -> PENDING > I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > state machine transition INIT -> PENDING > I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > state machine transition INIT -> PENDING > I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > state machine transition INIT -> PENDING > I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > state machine transition INIT -> PENDING > I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > ... > {noformat} > Observe that a single jetty thread is doing this. > We should leverage {{BatchWorker}} to have concurrent mutations here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING
[ https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677409#comment-15677409 ] Zameer Manji commented on AURORA-1823: -- Agreed that our API should do this. Simple profiling indicates this is slow because a single thread is iterating over every task and doing a single write for each one. If we did batching we could have a single thread moving many to PENDING at a time and if we used batchwoker we could have a pool of threads doing this. I'm not going to change the semantics of the API with BatchWorker. BatchWorker provides a future, and the caller of batch worker can block until the future resolves. Instead I think it would be best to move multiple tasks from INIT to PENDING at a time and have multiple threads doing that concurrently since there is no data dependency between the tasks. > `createJob` API uses single thread to move all tasks to PENDING > > > Key: AURORA-1823 > URL: https://issues.apache.org/jira/browse/AURORA-1823 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Priority: Minor > > If you create a single job with many tasks (lets say 10k+) the `createJob` > API will take a long time. This is because the `createJob` API only returns > when all of the tasks have moved to PENDING and it uses a single thread to do > so. Here is a snippet of the logs: > {noformat} > ... > I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > state machine transition INIT -> PENDING > I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a > I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > state machine transition INIT -> PENDING > I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 > I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > state machine transition INIT -> PENDING > I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 > I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > state machine transition INIT -> PENDING > I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 > I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > state machine transition INIT -> PENDING > I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 > I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > state machine transition INIT -> PENDING > I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 > I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > state machine transition INIT -> PENDING > I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work > command SAVE_STATE for > sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 > ... > {noformat} > Observe that a single jetty thread is doing this. > We should leverage {{BatchWorker}} to have concurrent mutations here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING
Zameer Manji created AURORA-1823: Summary: `createJob` API uses single thread to move all tasks to PENDING Key: AURORA-1823 URL: https://issues.apache.org/jira/browse/AURORA-1823 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Priority: Minor If you create a single job with many tasks (lets say 10k+) the `createJob` API will take a long time. This is because the `createJob` API only returns when all of the tasks have moved to PENDING and it uses a single thread to do so. Here is a snippet of the logs: {noformat} ... I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a state machine transition INIT -> PENDING I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work command SAVE_STATE for sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 state machine transition INIT -> PENDING I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work command SAVE_STATE for sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80 I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 state machine transition INIT -> PENDING I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work command SAVE_STATE for sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03 I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 state machine transition INIT -> PENDING I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work command SAVE_STATE for sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570 I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 state machine transition INIT -> PENDING I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work command SAVE_STATE for sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67 I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 state machine transition INIT -> PENDING I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work command SAVE_STATE for sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153 I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 state machine transition INIT -> PENDING I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work command SAVE_STATE for sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9 ... {noformat} Observe that a single jetty thread is doing this. We should leverage {{BatchWorker}} to have concurrent mutations here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1821) Bump Guava to 20
[ https://issues.apache.org/jira/browse/AURORA-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1821: Assignee: Zameer Manji > Bump Guava to 20 > > > Key: AURORA-1821 > URL: https://issues.apache.org/jira/browse/AURORA-1821 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Zameer Manji > > Guava 20 is now out with a bunch of improvements. We should take in the > upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1821) Bump Guava to 20
Zameer Manji created AURORA-1821: Summary: Bump Guava to 20 Key: AURORA-1821 URL: https://issues.apache.org/jira/browse/AURORA-1821 Project: Aurora Issue Type: Task Reporter: Zameer Manji Guava 20 is now out with a bunch of improvements. We should take in the upgrade. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1820) Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler
[ https://issues.apache.org/jira/browse/AURORA-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667969#comment-15667969 ] Zameer Manji commented on AURORA-1820: -- Good find [~mnurolahzade]! Do we measure throughput of {{TimedOutTaskHandler}} in benchmarks already? > Reduce storage write lock contention by adopting Double-Checked Locking > pattern in TimedOutTaskHandler > -- > > Key: AURORA-1820 > URL: https://issues.apache.org/jira/browse/AURORA-1820 > Project: Aurora > Issue Type: Task > Components: Efficiency, Scheduler >Reporter: Mehrdad Nurolahzade >Assignee: Mehrdad Nurolahzade >Priority: Critical > > {{TimedOutTaskHandler}} acquires storage write lock for every task every time > they transition to a transient state. It then verifies after a default > time-out period of 5 minutes if the task has transitioned out of the > transient state. > The verification step takes place while holding the storage write lock. In > over 99% of cases the logic short-circuits and returns from > {{StateManagerImpl.updateTaskAndExternalState()}} once it learns task has > transitioned out of the transient state. > Reduce storage write lock contention by adopting [Double-Checked > Locking|https://en.wikipedia.org/wiki/Double-checked_locking] pattern in > {{TimedOutTaskHandler.run()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1815) Fix checksums for packages on bintray
[ https://issues.apache.org/jira/browse/AURORA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1815: - Fix Version/s: 0.17.0 > Fix checksums for packages on bintray > - > > Key: AURORA-1815 > URL: https://issues.apache.org/jira/browse/AURORA-1815 > Project: Aurora > Issue Type: Story > Components: Packaging >Affects Versions: 0.16.0 >Reporter: Thomas Bach >Priority: Minor > Fix For: 0.17.0 > > > The checksum files on bintray are wrong. Take for example the content of > {{aurora-scheduler_0.16.0_amd64.deb.sha}}: > {quote} > b6203f169df44d9a91df3dfe4670950c3ab49eb4 > /Users/jcohen/workspace/external/aurora-packaging/artifacts/aurora-ubuntu-trusty/dist/aurora-scheduler_0.16.0_amd64.deb > {quote} > This should actually be: > {quote} > b6203f169df44d9a91df3dfe4670950c3ab49eb4 aurora-scheduler_0.16.0_amd64.deb > {quote} > NOTE: The checksum themselves seem to be correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1815) Fix checksums for packages on bintray
[ https://issues.apache.org/jira/browse/AURORA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655788#comment-15655788 ] Zameer Manji commented on AURORA-1815: -- Seems like a problem with the script/tooling. We should fix this before 0.17 and figure out how to fix the old shas. > Fix checksums for packages on bintray > - > > Key: AURORA-1815 > URL: https://issues.apache.org/jira/browse/AURORA-1815 > Project: Aurora > Issue Type: Story > Components: Packaging >Affects Versions: 0.16.0 >Reporter: Thomas Bach >Priority: Minor > Fix For: 0.17.0 > > > The checksum files on bintray are wrong. Take for example the content of > {{aurora-scheduler_0.16.0_amd64.deb.sha}}: > {quote} > b6203f169df44d9a91df3dfe4670950c3ab49eb4 > /Users/jcohen/workspace/external/aurora-packaging/artifacts/aurora-ubuntu-trusty/dist/aurora-scheduler_0.16.0_amd64.deb > {quote} > This should actually be: > {quote} > b6203f169df44d9a91df3dfe4670950c3ab49eb4 aurora-scheduler_0.16.0_amd64.deb > {quote} > NOTE: The checksum themselves seem to be correct. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1812) Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore
[ https://issues.apache.org/jira/browse/AURORA-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1812: - Fix Version/s: 0.17.0 > Upgrading scheduler multiple times in succession can lead to incompatible > snapshot restore > --- > > Key: AURORA-1812 > URL: https://issues.apache.org/jira/browse/AURORA-1812 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Affects Versions: 0.14.0 > Environment: Mesos-0.27.2 aurora-scheduler-0.14.0 >Reporter: Patrick Veasey >Priority: Minor > Fix For: 0.17.0 > > > When upgrading scheduler multiple times in a row there can be a situation > where the snapshot is restored is from an incompatible version. Which will > cause scheduler to fail to start, with SQL exceptions. Workaround is to > ensure the most current snapshot was taken by the current version of aurora, > either by manually starting snapshot or setting dlog_snapshot_interval to a > low timeframe. > Log of failure can be found here: > https://gist.github.com/Pveasey/4ca1ad4d3ded21cd6e1674f20a8a4af3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1812) Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore
[ https://issues.apache.org/jira/browse/AURORA-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652305#comment-15652305 ] Zameer Manji commented on AURORA-1812: -- I've put it on the list for 0.17 The fix could be changing our docs to say upgrading from old versions requires the operator to trigger a snapshot manually from `aurora_admin` and from 0.17+ they don't need to do that. > Upgrading scheduler multiple times in succession can lead to incompatible > snapshot restore > --- > > Key: AURORA-1812 > URL: https://issues.apache.org/jira/browse/AURORA-1812 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Affects Versions: 0.14.0 > Environment: Mesos-0.27.2 aurora-scheduler-0.14.0 >Reporter: Patrick Veasey >Priority: Minor > Fix For: 0.17.0 > > > When upgrading scheduler multiple times in a row there can be a situation > where the snapshot is restored is from an incompatible version. Which will > cause scheduler to fail to start, with SQL exceptions. Workaround is to > ensure the most current snapshot was taken by the current version of aurora, > either by manually starting snapshot or setting dlog_snapshot_interval to a > low timeframe. > Log of failure can be found here: > https://gist.github.com/Pveasey/4ca1ad4d3ded21cd6e1674f20a8a4af3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1814) Consider supporting PARTITION_AWARE capability
Zameer Manji created AURORA-1814: Summary: Consider supporting PARTITION_AWARE capability Key: AURORA-1814 URL: https://issues.apache.org/jira/browse/AURORA-1814 Project: Aurora Issue Type: Task Reporter: Zameer Manji Mesos 1.1.0 comes with a new capability called {{PARTITION_AWARE}}. If we opt in the following states would replace {{TASK_LOST}} {noformat} TASK_DROPPED TASK_UNREACHABLE TASK_GONE TASK_GONE_BY_OPERATOR TASK_UNKNOWN {noformat} We should consider adopting this. Even if the initial cut is just mapping all of those new states to {{TASK_LOST}} internally. These new states might simplify our reconciliation code. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1800) Support Mesos Maintenance primitives
[ https://issues.apache.org/jira/browse/AURORA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652287#comment-15652287 ] Zameer Manji commented on AURORA-1800: -- Mesos 1.1.0 comes with a new HTTP based driver. I think this is blocked on upgrading to that first. > Support Mesos Maintenance primitives > > > Key: AURORA-1800 > URL: https://issues.apache.org/jira/browse/AURORA-1800 > Project: Aurora > Issue Type: Story > Components: Maintenance >Reporter: Ankit Khera > > Support Mesos Maintenance primitives > Mesos 0.25.0 introduced the notion of maintenance primitives using which > operators can post maintenance schedule for machines. > More details here : http://mesos.apache.org/documentation/latest/maintenance/ > This request to have aurora start using these primitives and drain machines > in an SLA aware manner. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1813) Bump Mesos support to 1.1.0
Zameer Manji created AURORA-1813: Summary: Bump Mesos support to 1.1.0 Key: AURORA-1813 URL: https://issues.apache.org/jira/browse/AURORA-1813 Project: Aurora Issue Type: Task Reporter: Zameer Manji RC3 is out for Mesos 1.1.0 and it looks like it is going to pass, we should bump our support in 0.17 for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1813) Bump Mesos support to 1.1.0
[ https://issues.apache.org/jira/browse/AURORA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1813: - Fix Version/s: 0.17.0 > Bump Mesos support to 1.1.0 > --- > > Key: AURORA-1813 > URL: https://issues.apache.org/jira/browse/AURORA-1813 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji > Fix For: 0.17.0 > > > RC3 is out for Mesos 1.1.0 and it looks like it is going to pass, we should > bump our support in 0.17 for this case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1812) Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore
[ https://issues.apache.org/jira/browse/AURORA-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652255#comment-15652255 ] Zameer Manji commented on AURORA-1812: -- [~joshua.cohen] [~StephanErb] Maybe we can fix this by having the scheduler take a (new) snapshot right after recovery if there was schema migrations? > Upgrading scheduler multiple times in succession can lead to incompatible > snapshot restore > --- > > Key: AURORA-1812 > URL: https://issues.apache.org/jira/browse/AURORA-1812 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Affects Versions: 0.14.0 > Environment: Mesos-0.27.2 aurora-scheduler-0.14.0 >Reporter: Patrick Veasey >Priority: Minor > > When upgrading scheduler multiple times in a row there can be a situation > where the snapshot is restored is from an incompatible version. Which will > cause scheduler to fail to start, with SQL exceptions. Workaround is to > ensure the most current snapshot was taken by the current version of aurora, > either by manually starting snapshot or setting dlog_snapshot_interval to a > low timeframe. > Log of failure can be found here: > https://gist.github.com/Pveasey/4ca1ad4d3ded21cd6e1674f20a8a4af3 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed
[ https://issues.apache.org/jira/browse/AURORA-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1809: - Fix Version/s: 0.17.0 > Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed > --- > > Key: AURORA-1809 > URL: https://issues.apache.org/jira/browse/AURORA-1809 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji > Fix For: 0.17.0 > > > If you run it apart of the full test suite it fails like this: > {noformat} > FAILURES > __ TestRunnerKillProcessGroup.test_pg_is_killed __ > > self = object at 0x7f0c79893e10> > > [1mdef test_pg_is_killed(self):[0m > [1m runner = self.start_runner()[0m > [1m tm = TaskMonitor(runner.tempdir, > runner.task_id)[0m > [1m self.wait_until_running(tm)[0m > [1m process_state, run_number = > tm.get_active_processes()[0][0m > [1m assert process_state.process == 'process'[0m > [1m assert run_number == 0[0m > [1m[0m > [1m child_pidfile = os.path.join(runner.sandbox, > runner.task_id, 'child.txt')[0m > [1m while not os.path.exists(child_pidfile):[0m > [1mtime.sleep(0.1)[0m > [1m parent_pidfile = os.path.join(runner.sandbox, > runner.task_id, 'parent.txt')[0m > [1m while not os.path.exists(parent_pidfile):[0m > [1mtime.sleep(0.1)[0m > [1m with open(child_pidfile) as fp:[0m > [1mchild_pid = int(fp.read().rstrip())[0m > [1m with open(parent_pidfile) as fp:[0m > [1mparent_pid = int(fp.read().rstrip())[0m > [1m[0m > [1m ps = ProcessProviderFactory.get()[0m > [1m ps.collect_all()[0m > [1m assert parent_pid in ps.pids()[0m > [1m assert child_pid in ps.pids()[0m > [1m assert child_pid in > ps.children_of(parent_pid)[0m > [1m[0m > [1m with open(os.path.join(runner.sandbox, > runner.task_id, 'exit.txt'), 'w') as fp:[0m > [1mfp.write('go away!')[0m > [1m[0m > [1m while tm.task_state() is not > TaskState.SUCCESS:[0m > [1mtime.sleep(0.1)[0m > [1m[0m > [1m state = tm.get_state()[0m > [1m assert state.processes['process'][0].state == > ProcessState.SUCCESS[0m > [1m[0m > [1m ps.collect_all()[0m > [1m assert parent_pid not in ps.pids()[0m > [1m> assert child_pid not in ps.pids()[0m > [1m[31mE assert 30475 not in set([1, 2, 3, 5, 7, > 8, ...])[0m > [1m[31mE + where set([1, 2, 3, 5, 7, 8, ...]) = > at 0x7f0c798b1990>>()[0m > [1m[31mE +where ProcessProvider_Procfs.pids of > at 0x7f0c798b1990>> = > at 0x7f0c798b1990>.pids[0m > > > src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError > -- Captured stderr call -- > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > WARNING:root:Could not read from checkpoint > /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner > generated xml file: >
[jira] [Created] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed
Zameer Manji created AURORA-1809: Summary: Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed Key: AURORA-1809 URL: https://issues.apache.org/jira/browse/AURORA-1809 Project: Aurora Issue Type: Bug Reporter: Zameer Manji If you run it apart of the full test suite it fails like this: {noformat} FAILURES __ TestRunnerKillProcessGroup.test_pg_is_killed __ self = [1mdef test_pg_is_killed(self):[0m [1m runner = self.start_runner()[0m [1m tm = TaskMonitor(runner.tempdir, runner.task_id)[0m [1m self.wait_until_running(tm)[0m [1m process_state, run_number = tm.get_active_processes()[0][0m [1m assert process_state.process == 'process'[0m [1m assert run_number == 0[0m [1m[0m [1m child_pidfile = os.path.join(runner.sandbox, runner.task_id, 'child.txt')[0m [1m while not os.path.exists(child_pidfile):[0m [1mtime.sleep(0.1)[0m [1m parent_pidfile = os.path.join(runner.sandbox, runner.task_id, 'parent.txt')[0m [1m while not os.path.exists(parent_pidfile):[0m [1mtime.sleep(0.1)[0m [1m with open(child_pidfile) as fp:[0m [1mchild_pid = int(fp.read().rstrip())[0m [1m with open(parent_pidfile) as fp:[0m [1mparent_pid = int(fp.read().rstrip())[0m [1m[0m [1m ps = ProcessProviderFactory.get()[0m [1m ps.collect_all()[0m [1m assert parent_pid in ps.pids()[0m [1m assert child_pid in ps.pids()[0m [1m assert child_pid in ps.children_of(parent_pid)[0m [1m[0m [1m with open(os.path.join(runner.sandbox, runner.task_id, 'exit.txt'), 'w') as fp:[0m [1mfp.write('go away!')[0m [1m[0m [1m while tm.task_state() is not TaskState.SUCCESS:[0m [1mtime.sleep(0.1)[0m [1m[0m [1m state = tm.get_state()[0m [1m assert state.processes['process'][0].state == ProcessState.SUCCESS[0m [1m[0m [1m ps.collect_all()[0m [1m assert parent_pid not in ps.pids()[0m [1m> assert child_pid not in ps.pids()[0m [1m[31mE assert 30475 not in set([1, 2, 3, 5, 7, 8, ...])[0m [1m[31mE + where set([1, 2, 3, 5, 7, 8, ...]) = >()[0m [1m[31mE +where > = .pids[0m src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError -- Captured stderr call -- WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner WARNING:root:Could not read from checkpoint /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner generated xml file: /home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml [1m[31m 1 failed, 719 passed, 6 skipped, 1 warnings in 206.00 seconds [0m FAILURE {noformat} If you run the test as a one off you see this: {noformat} 00:45:32 00:00 [main] (To run a reporting server: ./pants server) 00:45:32 00:00 [setup] 00:45:32 00:00 [parse]fatal: Not a git repository (or any of the parent directories): .git
[jira] [Commented] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes
[ https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637620#comment-15637620 ] Zameer Manji commented on AURORA-1808: -- https://github.com/apache/aurora/commit/5410c229f30d6d8e331cdddc5c84b9b2b5313c01 > Thermos executor should send SIGTERM to daemonized processes > - > > Key: AURORA-1808 > URL: https://issues.apache.org/jira/browse/AURORA-1808 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > Thermos loses track of double forking processes, meaning on task teardown > the daemonized process will not receive a signal to shut down cleanly. > This can be a serious issue if one is running two processes: > 1. nginx which demonizes and accepts HTTP requests. > 2. A backend processes that receives traffic from nginx over a local socket. > On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to > still accept traffic even though the backend is dead. If thermos could also > send SIGTERM to 1, the task would tear down cleanly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1792) Executor does not log full task information.
[ https://issues.apache.org/jira/browse/AURORA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634819#comment-15634819 ] Zameer Manji commented on AURORA-1792: -- https://reviews.apache.org/r/53452/ > Executor does not log full task information. > > > Key: AURORA-1792 > URL: https://issues.apache.org/jira/browse/AURORA-1792 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > I launched a task that has an {{initial_interval_secs}} in the health check > config. However the log contains no information about this field: > {noformat} > $ grep "initial_interval_secs" __main__.log > {noformat} > We should log the entire ExecutorInfo blob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1792) Executor does not log full task information.
[ https://issues.apache.org/jira/browse/AURORA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1792: Assignee: Zameer Manji > Executor does not log full task information. > > > Key: AURORA-1792 > URL: https://issues.apache.org/jira/browse/AURORA-1792 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > I launched a task that has an {{initial_interval_secs}} in the health check > config. However the log contains no information about this field: > {noformat} > $ grep "initial_interval_secs" __main__.log > {noformat} > We should log the entire ExecutorInfo blob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1780) Offers with unknown resources types to Aurora crash the scheduler
[ https://issues.apache.org/jira/browse/AURORA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634508#comment-15634508 ] Zameer Manji commented on AURORA-1780: -- Yes, that is the most desirable course of action for now. > Offers with unknown resources types to Aurora crash the scheduler > - > > Key: AURORA-1780 > URL: https://issues.apache.org/jira/browse/AURORA-1780 > Project: Aurora > Issue Type: Bug > Environment: vagrant >Reporter: Renan DelValle > > Taking offers from Agents which have resources that are not known to Aurora > cause the Scheduler to crash. > Steps to reproduce: > {code} > vagrant up > sudo service mesos-slave stop > echo > "cpus(aurora-role):0.5;cpus(*):3.5;mem(aurora-role):1024;disk:2;gpus(*):4;test:200" > | sudo tee /etc/mesos-slave/resources > sudo rm -f /var/lib/mesos/meta/slaves/latest > sudo service mesos-slave start > {code} > Wait around a few moments for the offer to be made to Aurora > {code} > I0922 02:41:57.839 [Thread-19, MesosSchedulerImpl:142] Received notification > of lost agent: value: "cadaf569-171d-42fc-a417-fbd608ea5bab-S0" > I0922 02:42:30.585597 2999 log.cpp:577] Attempting to append 109 bytes to > the log > I0922 02:42:30.585654 2999 coordinator.cpp:348] Coordinator attempting to > write APPEND action at position 4 > I0922 02:42:30.585747 2999 replica.cpp:537] Replica received write request > for position 4 from (10)@192.168.33.7:8083 > I0922 02:42:30.586858 2999 leveldb.cpp:341] Persisting action (125 bytes) to > leveldb took 1.086601ms > I0922 02:42:30.586897 2999 replica.cpp:712] Persisted action at 4 > I0922 02:42:30.587020 2999 replica.cpp:691] Replica received learned notice > for position 4 from @0.0.0.0:0 > I0922 02:42:30.587785 2999 leveldb.cpp:341] Persisting action (127 bytes) to > leveldb took 746999ns > I0922 02:42:30.587805 2999 replica.cpp:712] Persisted action at 4 > I0922 02:42:30.587811 2999 replica.cpp:697] Replica learned APPEND action at > position 4 > I0922 02:42:30.601 [SchedulerImpl-0, OfferManager$OfferManagerImpl:185] > Returning offers for cadaf569-171d-42fc-a417-fbd608ea5bab-S1 for compaction. > Sep 22, 2016 2:42:38 AM > com.google.common.util.concurrent.ServiceManager$ServiceListener failed > SEVERE: Service SlotSizeCounterService [FAILED] has failed in the RUNNING > state. > java.lang.NullPointerException: Unknown Mesos resource: name: "test" > type: SCALAR > scalar { > value: 200.0 > } > role: "*" > at java.util.Objects.requireNonNull(Objects.java:228) > at > org.apache.aurora.scheduler.resources.ResourceType.fromResource(ResourceType.java:355) > at > org.apache.aurora.scheduler.resources.ResourceManager.lambda$static$0(ResourceManager.java:52) > at com.google.common.collect.Iterators$7.computeNext(Iterators.java:675) > at > com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143) > at > com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138) > at java.util.Iterator.forEachRemaining(Iterator.java:115) > at > java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) > at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > at > java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) > at > java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) > at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) > at > java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) > at > org.apache.aurora.scheduler.resources.ResourceManager.bagFromResources(ResourceManager.java:274) > at > org.apache.aurora.scheduler.resources.ResourceManager.bagFromMesosResources(ResourceManager.java:239) > at > org.apache.aurora.scheduler.stats.AsyncStatsModule$OfferAdapter.get(AsyncStatsModule.java:153) > at > org.apache.aurora.scheduler.stats.SlotSizeCounter.run(SlotSizeCounter.java:168) > at > org.apache.aurora.scheduler.stats.AsyncStatsModule$SlotSizeCounterService.runOneIteration(AsyncStatsModule.java:130) > at > com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:189) > at com.google.common.util.concurrent.Callables$3.run(Callables.java:100) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) >
[jira] [Updated] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes
[ https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1808: - Description: Thermos loses track of double forking processes, meaning on task teardown the daemonized process will not receive a signal to shut down cleanly. This can be a serious issue if one is running two processes: 1. nginx which demonizes and accepts HTTP requests. 2. A backend processes that receives traffic from nginx over a local socket. On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to still accept traffic even though the backend is dead. If thermos could also send SIGTERM to 1, the task would tear down cleanly. was: Thermos loses track of double forking processes, meaning on task teardown the daemonized process will not receive a signal to shut down cleanly. This can be a serious issue if one is running two processes: 1. nginx which demonizes and accepts HTTP requests. 2. A back and processes that receives traffic from nginx over a local socket. On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to still accept traffic even though the backend is dead. If thermos could also send SIGTERM to 1, the task would tear down cleanly. > Thermos executor should send SIGTERM to daemonized processes > - > > Key: AURORA-1808 > URL: https://issues.apache.org/jira/browse/AURORA-1808 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > Thermos loses track of double forking processes, meaning on task teardown > the daemonized process will not receive a signal to shut down cleanly. > This can be a serious issue if one is running two processes: > 1. nginx which demonizes and accepts HTTP requests. > 2. A backend processes that receives traffic from nginx over a local socket. > On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to > still accept traffic even though the backend is dead. If thermos could also > send SIGTERM to 1, the task would tear down cleanly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes
[ https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630367#comment-15630367 ] Zameer Manji commented on AURORA-1808: -- WIP Solution here: https://reviews.apache.org/r/53403/ > Thermos executor should send SIGTERM to daemonized processes > - > > Key: AURORA-1808 > URL: https://issues.apache.org/jira/browse/AURORA-1808 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Zameer Manji > > Thermos loses track of double forking processes, meaning on task teardown > the daemonized process will not receive a signal to shut down cleanly. > This can be a serious issue if one is running two processes: > 1. nginx which demonizes and accepts HTTP requests. > 2. A back and processes that receives traffic from nginx over a local socket. > On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to > still accept traffic even though the backend is dead. If thermos could also > send SIGTERM to 1, the task would tear down cleanly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes
Zameer Manji created AURORA-1808: Summary: Thermos executor should send SIGTERM to daemonized processes Key: AURORA-1808 URL: https://issues.apache.org/jira/browse/AURORA-1808 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Assignee: Zameer Manji Thermos loses track of double forking processes, meaning on task teardown the daemonized process will not receive a signal to shut down cleanly. This can be a serious issue if one is running two processes: 1. nginx which demonizes and accepts HTTP requests. 2. A back and processes that receives traffic from nginx over a local socket. On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to still accept traffic even though the backend is dead. If thermos could also send SIGTERM to 1, the task would tear down cleanly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1107) Add support for mounting task specified external volumes into containers
[ https://issues.apache.org/jira/browse/AURORA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623909#comment-15623909 ] Zameer Manji commented on AURORA-1107: -- DSL + e2e tests https://reviews.apache.org/r/5/ > Add support for mounting task specified external volumes into containers > > > Key: AURORA-1107 > URL: https://issues.apache.org/jira/browse/AURORA-1107 > Project: Aurora > Issue Type: Task > Components: Docker >Reporter: Steve Niemitz >Assignee: Zameer Manji >Priority: Minor > > The Mesos docker API allows specifying volumes on the host to mount into the > container when it runs. We should expose this. I propose: > - Add a volumes() set to the Docker object in base.py > - Add a similar set to the DockerContainer struct in api.thrift > - Create a way for administrators to restrict the ability to use this. > Because mounts are set up by the docker daemon, they effectively allow > someone who can configure mounts to access anything on the machine. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1805) Enhance `Process` object to allow easier access to environment variables
[ https://issues.apache.org/jira/browse/AURORA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613357#comment-15613357 ] Zameer Manji commented on AURORA-1805: -- It still suffers the same string interpolation issues as constructing the command line. > Enhance `Process` object to allow easier access to environment variables > > > Key: AURORA-1805 > URL: https://issues.apache.org/jira/browse/AURORA-1805 > Project: Aurora > Issue Type: Task > Components: Thermos >Reporter: Zameer Manji > > The thermos DSL: > {noformat} > class Process(Struct): > cmdline = Required(String) > name= Required(String) > # This is currently unused but reserved for future use by Thermos. > resources = Resources > # optionals > max_failures = Default(Integer, 1) # maximum number of failed process > runs ># before process is failed. > daemon= Default(Boolean, False) > ephemeral = Default(Boolean, False) > min_duration = Default(Integer, 5) # integer seconds > final = Default(Boolean, False) # if this process should be a > finalizing process ># that should always be run after > regular processes > logger= Default(Logger, Empty) > {noformat} > If we can add a new field: > {noformat} > environment = Default(Map(String, String), {}) > {noformat} > It will make it much easier to add environment variables. > Right now the solution is to prefix environment variables to cmdline which > can get janky and frustrating with the string interpolation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1762) /pendingtasks endpoint should show reason tasks are pending
[ https://issues.apache.org/jira/browse/AURORA-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1762: - Assignee: Pradyumna Kaushik > /pendingtasks endpoint should show reason tasks are pending > --- > > Key: AURORA-1762 > URL: https://issues.apache.org/jira/browse/AURORA-1762 > Project: Aurora > Issue Type: Task >Reporter: David Robinson >Assignee: Pradyumna Kaushik >Priority: Minor > Labels: newbie > > the /pendingtasks endpoint is essentially useless as is, it shows that tasks > are pending but doesn't show why. The information is also not easily > discovered via the /scheduler UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1805) Enhance `Process` object to allow easier access to environment variables
[ https://issues.apache.org/jira/browse/AURORA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1805: - Summary: Enhance `Process` object to allow easier access to environment variables (was: Enhance `Process` object to allow easier access) > Enhance `Process` object to allow easier access to environment variables > > > Key: AURORA-1805 > URL: https://issues.apache.org/jira/browse/AURORA-1805 > Project: Aurora > Issue Type: Task > Components: Thermos >Reporter: Zameer Manji > > The thermos DSL: > {noformat} > class Process(Struct): > cmdline = Required(String) > name= Required(String) > # This is currently unused but reserved for future use by Thermos. > resources = Resources > # optionals > max_failures = Default(Integer, 1) # maximum number of failed process > runs ># before process is failed. > daemon= Default(Boolean, False) > ephemeral = Default(Boolean, False) > min_duration = Default(Integer, 5) # integer seconds > final = Default(Boolean, False) # if this process should be a > finalizing process ># that should always be run after > regular processes > logger= Default(Logger, Empty) > {noformat} > If we can add a new field: > {noformat} > environment = Default(Map(String, String), {}) > {noformat} > It will make it much easier to add environment variables. > Right now the solution is to prefix environment variables to cmdline which > can get janky and frustrating with the string interpolation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1805) Enhance `Process` object to allow easier access
[ https://issues.apache.org/jira/browse/AURORA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1805: - Description: The thermos DSL: {noformat} class Process(Struct): cmdline = Required(String) name= Required(String) # This is currently unused but reserved for future use by Thermos. resources = Resources # optionals max_failures = Default(Integer, 1) # maximum number of failed process runs # before process is failed. daemon= Default(Boolean, False) ephemeral = Default(Boolean, False) min_duration = Default(Integer, 5) # integer seconds final = Default(Boolean, False) # if this process should be a finalizing process # that should always be run after regular processes logger= Default(Logger, Empty) {noformat} If we can add a new field: {noformat} environment = Default(Map(String, String), {}) {noformat} It will make it much easier to add environment variables. Right now the solution is to prefix environment variables to cmdline which can get janky and frustrating with the string interpolation. was: The thermos DSL: {noformat} class Process(Struct): cmdline = Required(String) name= Required(String) # This is currently unused but reserved for future use by Thermos. resources = Resources # optionals max_failures = Default(Integer, 1) # maximum number of failed process runs # before process is failed. daemon= Default(Boolean, False) ephemeral = Default(Boolean, False) min_duration = Default(Integer, 5) # integer seconds final = Default(Boolean, False) # if this process should be a finalizing process # that should always be run after regular processes logger= Default(Logger, Empty) {noformat} If we can add a new field: {noformat} process = Default(Map(String, String), {}) {noformat} It will make it much easier to add environment variables. Right now the solution is to prefix environment variables to cmdline which can get janky and frustrating with the string interpolation. > Enhance `Process` object to allow easier access > --- > > Key: AURORA-1805 > URL: https://issues.apache.org/jira/browse/AURORA-1805 > Project: Aurora > Issue Type: Task > Components: Thermos >Reporter: Zameer Manji > > The thermos DSL: > {noformat} > class Process(Struct): > cmdline = Required(String) > name= Required(String) > # This is currently unused but reserved for future use by Thermos. > resources = Resources > # optionals > max_failures = Default(Integer, 1) # maximum number of failed process > runs ># before process is failed. > daemon= Default(Boolean, False) > ephemeral = Default(Boolean, False) > min_duration = Default(Integer, 5) # integer seconds > final = Default(Boolean, False) # if this process should be a > finalizing process ># that should always be run after > regular processes > logger= Default(Logger, Empty) > {noformat} > If we can add a new field: > {noformat} > environment = Default(Map(String, String), {}) > {noformat} > It will make it much easier to add environment variables. > Right now the solution is to prefix environment variables to cmdline which > can get janky and frustrating with the string interpolation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1802) AttributeAggregate slows down scheduling of jobs with many instances
[ https://issues.apache.org/jira/browse/AURORA-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609746#comment-15609746 ] Zameer Manji commented on AURORA-1802: -- Thanks for the analysis [~StephanErb]! I think reducing the number of SQL queries would yield the most benefit but we should implement all three of them. > AttributeAggregate slows down scheduling of jobs with many instances > > > Key: AURORA-1802 > URL: https://issues.apache.org/jira/browse/AURORA-1802 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Reporter: Stephan Erb > > The current implementation of > [{{AttributeAggregate}}|https://github.com/apache/aurora/blob/f559e930659e25b3d7cacb7b845ebda50d18d66a/src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java] > slows down scheduling of jobs with many instances. Interestingly, this is > currently not visible in our job scheduling benchmark results as it only > affects the benchmark setup time but not the measured part. > {{AttributeAggregate}} relies on {{Suppliers.memoize}} to ensure that it is > only computed once and only when necessary. This has probably been done > because the factory > [{{AttributeAggregate.getJobActiveState}}|https://github.com/apache/aurora/blob/f559e930659e25b3d7cacb7b845ebda50d18d66a/src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java#L56-L91] > is slow. > After some recent changes to schedule multiple task instances per scheduling > round the aggregate is computed in each scheduling round via the call > [{{resourceRequest.getJobState().updateAttributeAggregate(...)}} > |https://github.com/apache/aurora/blob/f559e930659e25b3d7cacb7b845ebda50d18d66a/src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java#L173] > in {{TaskAssigner}}. This means the expensive factory is called once per > scheduling round. > h3. Potential improvements > * the current factory implementation performs one {{fetchTasks}} query > followed by {{n}} distinct {{getHostAttributes}} queries. This could be > reduced to a single SQL query. > * the aggregate makes heavy use of {{ImmutableMultiset}} even though it is > not immutable any more. There is potential room for improvement here. > * The aggregate uses suppliers to perform a lazy instantiation even though > its current usage is not lazy any more. We can either make the implementation > eager, or ensure that the expensive part is only run when absolutely > necessary. > h3. Proof of concept > * 4 mins 23.407 secs -- total runtime of {{./gradlew jmh > -Pbenchmarks='SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark'}} > * 2 mins 40.308 secs -- total runtime of {{./gradlew jmh > -Pbenchmarks='SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark'}} > with [{{resourceRequest.getJobState().updateAttributeAggregate(...)}} > |https://github.com/apache/aurora/blob/f559e930659e25b3d7cacb7b845ebda50d18d66a/src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java#L173] > commented out. This works as the call is not necessary when only a single > instance is scheduled per scheduling round, as done in the benchmarks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1801) TaskObserver thread stops refreshing after filesystem race condition
[ https://issues.apache.org/jira/browse/AURORA-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609736#comment-15609736 ] Zameer Manji commented on AURORA-1801: -- I am a big fan of making the process fail if the `TaskObserver` thread fails. That matches up with patterns elsewhere in the code. We can also prevent the race condition too. > TaskObserver thread stops refreshing after filesystem race condition > > > Key: AURORA-1801 > URL: https://issues.apache.org/jira/browse/AURORA-1801 > Project: Aurora > Issue Type: Bug > Components: Observer >Reporter: Stephan Erb > > It seems like that a race condition accessing the Mesos filesystem layout can > bubble up and terminate the {{TaskObserver}} thread responsible for > refreshing the internal data structure of available tasks. Restarting the > observer fixes the problem. > Exception triggering the issue: > {code} > Traceback (most recent call last): > File > "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.bce9e54ac7cded79a75603fb4e6bcef2c7d1e6bc/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", > line 126, in _excepting_run > self.__real_run(*args, **kw) > File "apache/thermos/observer/task_observer.py", line 135, in run > File "apache/thermos/observer/detector.py", line 74, in refresh > File "apache/thermos/observer/detector.py", line 58, in _refresh_detectors > File "apache/aurora/executor/common/path_detector.py", line 34, in get_paths > File "apache/aurora/executor/common/path_detector.py", line 34, in > File "apache/aurora/executor/common/path_detector.py", line 33, in iterate > File "/usr/lib/python2.7/posixpath.py", line 376, in realpath > resolved = _resolve_link(component) > File "/usr/lib/python2.7/posixpath.py", line 399, in _resolve_link > resolved = os.readlink(path) > OSError: [Errno 2] No such file or directory: > '/var/lib/mesos/slaves/0768bcb3-205d-4409-a726-3001ad3ef902-S10/frameworks/20151001-085346-58917130-5050-37976-/executors/thermos-role-env-myname-0-f9fe0318-d39f-49d3-bdf8-e954d5879b33/runs/latest' > {code} > Solution space: > * terminate the observer process if the {{TaskOberver}} thread fails > * prevent unknown exceptions from aborting the {{TaskOberver}} run loop > * prevent the observed race condition in {{detector.py}} or > {{path_detector.py}} > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1380) Upgrade to guice 4.0
[ https://issues.apache.org/jira/browse/AURORA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595775#comment-15595775 ] Zameer Manji commented on AURORA-1380: -- The upstream ticket SHIRO-493 has been resolved and an RC/Release for shiro 1.4 is soon. We will be able to close this ticket then. > Upgrade to guice 4.0 > > > Key: AURORA-1380 > URL: https://issues.apache.org/jira/browse/AURORA-1380 > Project: Aurora > Issue Type: Story > Components: Scheduler >Reporter: Kevin Sweeney >Priority: Critical > > Guice 4.0 has been released. Among the new features, probably the most > significant is Java 8 support - in Guice 3.0 stack traces are obfuscated by > https://github.com/google/guice/issues/757. As our code expands use of > lambdas and method references this will become even more critical. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1799) Thermos does not handle low memory scenarios gracefully
Zameer Manji created AURORA-1799: Summary: Thermos does not handle low memory scenarios gracefully Key: AURORA-1799 URL: https://issues.apache.org/jira/browse/AURORA-1799 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Background: In an environment where Aurora is used to launch Docker containers via the DockerContainerizer, it was observed that some tasks would not be killed. What happened is that a task was allocated with a low amount of memory but demanded a lot. This caused the linux OOM killer to be invoked. Unlike the MesosContainerizer, the agent doesn't tear down the container when the OOM killer is invoked. Instead the OOM killer just kills a process in the container and thermos and mesos are unaware (unless a process directly launched by thermos is killed). I observed in the scheduler logs that the scheduler was trying to kill a container every reconciliation period but it never died. The slave had the logs indicating it received the killTask RPC and forwarded it to Thermos. The thermos logs had several entries like every hour: {noformat} I1018 20:39:18.102894 6 executor_base.py:45] Executor [aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: Activating kill manager. I1018 20:39:18.103034 6 executor_base.py:45] Executor [aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: killTask returned. I1018 21:39:17.859935 6 executor_base.py:45] Executor [aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: killTask got task_id: value: "" {noformat} However, the tasks was never killed. Looking at the stderr of thermos I saw the following entries: {noformat} Logged from file resource.py, line 155 Traceback (most recent call last): File "/usr/lib/python2.7/logging/__init__.py", line 883, in emit self.flush() File "/usr/lib/python2.7/logging/__init__.py", line 843, in flush self.stream.flush() IOError: [Errno 12] Cannot allocate memory {noformat} and {noformat} Logged from file thermos_task_runner.py, line 171 Traceback (most recent call last): File "/root/.pex/install/twitter.common.exceptions-0.3.3-py2-none-any.whl.2a67b833b1517d179ef1c8dc6f2dac1023d51e3c/twitter.common.exceptions-0.3.3-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run File "apache/aurora/executor/status_manager.py", line 47, in run File "apache/aurora/executor/common/status_checker.py", line 97, in status File "apache/aurora/executor/thermos_task_runner.py", line 358, in status File "apache/aurora/executor/thermos_task_runner.py", line 186, in compute_status File "apache/aurora/executor/thermos_task_runner.py", line 136, in task_state File "apache/thermos/monitoring/monitor.py", line 118, in task_state File "apache/thermos/monitoring/monitor.py", line 114, in get_state File "apache/thermos/monitoring/monitor.py", line 77, in _apply_states File "/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py", line 182, in try_read class InvalidTypeException(Error): pass File "/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py", line 168, in read return RecordIO.Reader.do_read(self._fp, self._codec) File "/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py", line 135, in do_read header = fp.read(RecordIO.RECORD_HEADER_SIZE) File "/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/filelike.py", line 81, in read return self._fp.read(length) IOError: [Errno 12] Cannot allocate memory {noformat} It seems the regular avenues of reading checkpoints or logging data, thermos would get an IOError. Some part of twitter common installs an excepthook to log the exception, but we don't seem to do anything else. I think we should probably install our own exception hook to send a {{LOST_TASK}} with the exception information instead of failing to kill the task. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (AURORA-1795) Internal server error in scheduler Thrift API on missing Content-Type
[ https://issues.apache.org/jira/browse/AURORA-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji reassigned AURORA-1795: Assignee: Zameer Manji > Internal server error in scheduler Thrift API on missing Content-Type > - > > Key: AURORA-1795 > URL: https://issues.apache.org/jira/browse/AURORA-1795 > Project: Aurora > Issue Type: Bug > Components: Scheduler >Affects Versions: 0.16.0 >Reporter: Stephan Erb >Assignee: Zameer Manji > > This happens if a user has a very old browser, i.e. Firefox 41. > {code} > I1017 09:38:15.618 [qtp1426166274-44336, Slf4jRequestLog:60] 10.x.x.x - - > [17/Oct/2016:09:38:15 +] "POST //foobar.example.org/api HTTP/1.1" 200 794 > W1017 09:38:15.627 [qtp1426166274-44066, ServletHandler:631] /api > java.lang.NullPointerException: null > at java.util.Objects.requireNonNull(Objects.java:203) > ~[na:1.8.0-internal] > at java.util.Optional.(Optional.java:96) ~[na:1.8.0-internal] > at java.util.Optional.of(Optional.java:108) ~[na:1.8.0-internal] > at > org.apache.aurora.scheduler.http.api.TContentAwareServlet.doPost(TContentAwareServlet.java:123) > ~[aurora-0.16.0.jar:na] > at > org.apache.aurora.scheduler.http.api.TContentAwareServlet.doGet(TContentAwareServlet.java:164) > ~[aurora-0.16.0.jar:na] > at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > ~[javax.servlet-api-3.1.0.jar:3.1.0] > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > ~[javax.servlet-api-3.1.0.jar:3.1.0] > at > com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > org.apache.aurora.scheduler.http.LeaderRedirectFilter.doFilter(LeaderRedirectFilter.java:72) > ~[aurora-0.16.0.jar:na] > at > org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44) > ~[aurora-0.16.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > org.apache.aurora.scheduler.http.HttpStatsFilter.doFilter(HttpStatsFilter.java:71) > ~[aurora-0.16.0.jar:na] > at > org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44) > ~[aurora-0.16.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58) > ~[guice-servlet-3.0.jar:na] > at > com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118) >
[jira] [Commented] (AURORA-1796) Several JMH microbenchmarks are failing
[ https://issues.apache.org/jira/browse/AURORA-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583606#comment-15583606 ] Zameer Manji commented on AURORA-1796: -- This is a guice binding error that is obscured by the fact we are on JDK 8 with not Guice 4.0. > Several JMH microbenchmarks are failing > --- > > Key: AURORA-1796 > URL: https://issues.apache.org/jira/browse/AURORA-1796 > Project: Aurora > Issue Type: Bug >Reporter: Stephan Erb > > In the context of https://reviews.apache.org/r/52921/ I tried to run our > micro benchmarks: > * {{UpdateStoreBenchmarks}} seems to work as expected > * {{StatusUpdateBenchmark}} seems to work ax expected > * {{TaskStoreBenchmarks}} seems wo work as expected otherwise. However, the > ops/sec for the h2 based tests seems to be off by a great margin. > * {{SchedulingBenchmarks}} seems to take for ever. I aborted after 4 hours > * {{SnapshotBenchmarks}} fails with the exception below > * {{ThriftApiBenchmarks}} fails with the exception below > This ticket is about the last two failing benchmarks. The following > exception is written for each benchmark, indicating a problem in guice: > {code} > com.google.inject.internal.util.$ComputationException: > java.lang.ArrayIndexOutOfBoundsException: 44204 > at > com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:553) > at > com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:419) > at > com.google.inject.internal.util.$CustomConcurrentHashMap$ComputingImpl.get(CustomConcurrentHashMap.java:2041) > at > com.google.inject.internal.util.$StackTraceElements.forMember(StackTraceElements.java:53) > at > com.google.inject.internal.Errors.formatInjectionPoint(Errors.java:716) > at > com.google.inject.internal.Errors.formatSource(Errors.java:678) > at com.google.inject.internal.Errors.format(Errors.java:555) > at > com.google.inject.CreationException.getMessage(CreationException.java:48) > at java.lang.Throwable.getLocalizedMessage(Throwable.java:391) > at java.lang.Throwable.toString(Throwable.java:480) > at java.lang.Throwable.(Throwable.java:311) > at java.lang.Exception.(Exception.java:102) > at java.lang.RuntimeException.(RuntimeException.java:96) > at > org.openjdk.jmh.runner.BenchmarkException.(BenchmarkException.java:34) > at > org.openjdk.jmh.runner.BenchmarkHandler.runIteration(BenchmarkHandler.java:438) > at > org.openjdk.jmh.runner.BaseRunner.runBenchmark(BaseRunner.java:263) > at > org.openjdk.jmh.runner.BaseRunner.runBenchmark(BaseRunner.java:235) > at > org.openjdk.jmh.runner.BaseRunner.doSingle(BaseRunner.java:142) > at > org.openjdk.jmh.runner.BaseRunner.runBenchmarksForked(BaseRunner.java:76) > at org.openjdk.jmh.runner.ForkedRunner.run(ForkedRunner.java:72) > at org.openjdk.jmh.runner.ForkedMain.main(ForkedMain.java:84) > Caused by: java.lang.ArrayIndexOutOfBoundsException: 44204 > at com.google.inject.internal.asm.$ClassReader.(Unknown > Source) > at com.google.inject.internal.asm.$ClassReader.(Unknown > Source) > at com.google.inject.internal.asm.$ClassReader.(Unknown > Source) > at > com.google.inject.internal.util.$LineNumbers.(LineNumbers.java:62) > at > com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:36) > at > com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:33) > at > com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:549) > ... 20 more > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop
[ https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570354#comment-15570354 ] Zameer Manji commented on AURORA-1789: -- I have updated the title and assignee to reflect reality. Thanks for investigating and self serving [~jpinkul]! > Incorrect --mesos_containerizer_path value results in thermos failure loop > -- > > Key: AURORA-1789 > URL: https://issues.apache.org/jira/browse/AURORA-1789 > Project: Aurora > Issue Type: Bug > Components: Executor >Affects Versions: 0.16.0 >Reporter: Justin Pinkul >Assignee: Justin Pinkul > > When using the Mesos containerizer with namespaces/pid isolator and a Docker > image the Thermos executor is unable to launch processes. The executor tries > to fork the process then is unable to locate the process after the fork. > {code:title=thermos_runner.INFO} > I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, > coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789782.842882) > I1006 21:37:22.931456 75 helper.py:153] Coordinator BigBrother start [pid: > 1144] completed. > I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, > coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789842.935872) > I1006 21:38:23.025332 75 helper.py:153] Coordinator BigBrother start [pid: > 1157] completed. > I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, > coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789903.029694) > I1006 21:39:23.118841 75 helper.py:153] Coordinator BigBrother start [pid: > 1170] completed. > I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, > coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789963.123206) > I1006 21:40:23.212711 75 helper.py:153] Coordinator BigBrother start [pid: > 1183] completed. > I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, > coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475790023.21709) > I1006 21:41:23.307157 75 helper.py:153] Coordinator BigBrother start [pid: > 1196] completed. > I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, > coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475790083.311512) > I1006 21:42:23.399893 75 helper.py:153] Coordinator BigBrother start [pid: > 1209] completed. > I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an > abnormal termination > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop
[ https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1789: - Summary: Incorrect --mesos_containerizer_path value results in thermos failure loop (was: namespaces/pid isolator causes lost process) > Incorrect --mesos_containerizer_path value results in thermos failure loop > -- > > Key: AURORA-1789 > URL: https://issues.apache.org/jira/browse/AURORA-1789 > Project: Aurora > Issue Type: Bug > Components: Executor >Affects Versions: 0.16.0 >Reporter: Justin Pinkul >Assignee: Zameer Manji > > When using the Mesos containerizer with namespaces/pid isolator and a Docker > image the Thermos executor is unable to launch processes. The executor tries > to fork the process then is unable to locate the process after the fork. > {code:title=thermos_runner.INFO} > I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, > coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789782.842882) > I1006 21:37:22.931456 75 helper.py:153] Coordinator BigBrother start [pid: > 1144] completed. > I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, > coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789842.935872) > I1006 21:38:23.025332 75 helper.py:153] Coordinator BigBrother start [pid: > 1157] completed. > I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, > coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789903.029694) > I1006 21:39:23.118841 75 helper.py:153] Coordinator BigBrother start [pid: > 1170] completed. > I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, > coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475789963.123206) > I1006 21:40:23.212711 75 helper.py:153] Coordinator BigBrother start [pid: > 1183] completed. > I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, > coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475790023.21709) > I1006 21:41:23.307157 75 helper.py:153] Coordinator BigBrother start [pid: > 1196] completed. > I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an > abnormal termination > I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start) > I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: > ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, > coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, > fork_time=1475790083.311512) > I1006 21:42:23.399893 75 helper.py:153] Coordinator BigBrother start [pid: > 1209] completed. > I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an > abnormal termination > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information
[ https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570275#comment-15570275 ] Zameer Manji commented on AURORA-1785: -- I don't think it's "too much', it is exactly what the leader would advertise. > Populate curator latches with scheduler information > --- > > Key: AURORA-1785 > URL: https://issues.apache.org/jira/browse/AURORA-1785 > Project: Aurora > Issue Type: Task >Reporter: Zameer Manji >Assignee: Jing Chen >Priority: Minor > Labels: newbie > > If you look at the mesos ZK node for leader election you see something like > this: > {noformat} > u'json.info_000104', > u'json.info_000102', > u'json.info_000101', > u'json.info_98', > u'json.info_97' > {noformat} > Each of these nodes contains data about the machine contending for > leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an > operator can inspect who is contending for leadership by checking the content > of the nodes. > When you check the aurora ZK node you see something like this: > {noformat} > u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774', > u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776', > u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775', > u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784', > u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780', > u'member_000781' > {noformat} > Only the leader node contains information. The curator latches contain no > information. It is not possible to figure out which machines are contending > for leadership purely from ZK. > I think we should attach data to the latches like mesos. > Being able to do this is invaluable to debug issues if an extra master is > added to the cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1792) Executor does not log full task information.
Zameer Manji created AURORA-1792: Summary: Executor does not log full task information. Key: AURORA-1792 URL: https://issues.apache.org/jira/browse/AURORA-1792 Project: Aurora Issue Type: Bug Reporter: Zameer Manji I launched a task that has an {{initial_interval_secs}} in the health check config. However the log contains no information about this field: {noformat} $ grep "initial_interval_secs" __main__.log {noformat} We should log the entire ExecutorInfo blob. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zameer Manji updated AURORA-1791: - Description: The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] is not backwards compatible. The last section of the commit {quote} 4. Modified the Health Checker and redefined the meaning initial_interval_secs. {quote} has serious, unintended consequences. Consider the following health check config: {noformat} initial_interval_secs: 10 interval_secs: 5 max_consecutive_failures: 1 {noformat} On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here the earliest a task can cause failure is at the 10th second. On master, health checking starts right away which means the task can fail at the first second since {{max_consecutive_failures}} is set to 1. This is not backwards compatible and needs to be fixed. I think a good solution would be to revert the meaning change to initial_interval_secs and have the task transition into RUNNING when {{max_consecutive_successes}} is met. An investigation shows {{initial_interval_secs}} was set to 5 but the task failed health checks right away: {noformat} D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. Performing health check. D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures counter. D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum consecutive successes. {noformat} was: The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] is not backwards compatible. The last section of the commit {quote} 4. Modified the Health Checker and redefined the meaning initial_interval_secs. {quote} has serious, unintended consequences. Consider the following health check config: {noformat} initial_interval_secs: 10 interval_secs: 5 max_consecutive_failures: 1 {noformat} On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here the earliest a task can cause failure is at the 10th second. On master, health checking starts right away which means the task can fail at the first second since {{max_consecutive_failures}} is set to 1. This is not backwards compatible and needs to be fixed. I think a good solution would be to revert the meaning change to initial_interval_secs and have the task transition into RUNNING when {{max_consecutive_successes}} is met. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. > An investigation shows {{initial_interval_secs}} was set to 5 but the task > failed health checks right away: > {noformat} > D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. > Performing health check. > D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures > counter. > D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired. > W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum > consecutive successes. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.
[ https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1557#comment-1557 ] Zameer Manji commented on AURORA-1791: -- Note, I could be wrong here but this was deployed to a cluster and tasks that were healthy before started to fail. > Commit ca683 is not backwards compatible. > - > > Key: AURORA-1791 > URL: https://issues.apache.org/jira/browse/AURORA-1791 > Project: Aurora > Issue Type: Bug >Reporter: Zameer Manji >Assignee: Kai Huang >Priority: Blocker > > The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | > https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] > is not backwards compatible. The last section of the commit > {quote} > 4. Modified the Health Checker and redefined the meaning > initial_interval_secs. > {quote} > has serious, unintended consequences. > Consider the following health check config: > {noformat} > initial_interval_secs: 10 > interval_secs: 5 > max_consecutive_failures: 1 > {noformat} > On the 0.16.0 executor, no health checking will occur for the first 10 > seconds. Here the earliest a task can cause failure is at the 10th second. > On master, health checking starts right away which means the task can fail at > the first second since {{max_consecutive_failures}} is set to 1. > This is not backwards compatible and needs to be fixed. > I think a good solution would be to revert the meaning change to > initial_interval_secs and have the task transition into RUNNING when > {{max_consecutive_successes}} is met. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AURORA-1791) Commit ca683 is not backwards compatible.
Zameer Manji created AURORA-1791: Summary: Commit ca683 is not backwards compatible. Key: AURORA-1791 URL: https://issues.apache.org/jira/browse/AURORA-1791 Project: Aurora Issue Type: Bug Reporter: Zameer Manji Assignee: Kai Huang Priority: Blocker The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9] is not backwards compatible. The last section of the commit {quote} 4. Modified the Health Checker and redefined the meaning initial_interval_secs. {quote} has serious, unintended consequences. Consider the following health check config: {noformat} initial_interval_secs: 10 interval_secs: 5 max_consecutive_failures: 1 {noformat} On the 0.16.0 executor, no health checking will occur for the first 10 seconds. Here the earliest a task can cause failure is at the 10th second. On master, health checking starts right away which means the task can fail at the first second since {{max_consecutive_failures}} is set to 1. This is not backwards compatible and needs to be fixed. I think a good solution would be to revert the meaning change to initial_interval_secs and have the task transition into RUNNING when {{max_consecutive_successes}} is met. -- This message was sent by Atlassian JIRA (v6.3.4#6332)