[jira] [Commented] (AURORA-1897) Remove task length restrictions.

2017-06-12 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1897?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16047016#comment-16047016
 ] 

Zameer Manji commented on AURORA-1897:
--

{noformat}
commit 40d9d4dbec86cb4a17e281dc10ede25e83613eff
Author: Zameer Manji 
Date:   Mon Jun 12 13:14:18 2017 -0700

Remove restriction on task id length.

To work around an old Mesos bug (MESOS-691) we would reject jobs that 
resulted
in Mesos task ids longer than 255 characters. This is because Mesos used to 
use
the task id to generate the cgroup path. Now Mesos uses it's own id, we no
longer need to work around this bug.

This removes the restriction in the API layer. This is useful because some 
users
may have very long role and service names that caused task ids to go over 
this
limit.

Bugs closed: AURORA-1897

Reviewed at https://reviews.apache.org/r/59957/

 .../scheduler/thrift/SchedulerThriftInterface.java | 22 -
 .../thrift/SchedulerThriftInterfaceTest.java   | 99 --
 2 files changed, 121 deletions(-)
{noformat}

> Remove task length restrictions.
> 
>
> Key: AURORA-1897
> URL: https://issues.apache.org/jira/browse/AURORA-1897
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
>
> Currently we restrict the total name of a task because of a Mesos bug:
> {noformat}
>   // This number is derived from the maximum file name length limit on most 
> UNIX systems, less
>   // the number of characters we've observed being added by mesos for the 
> executor ID, prefix, and
>   // delimiters.
>   @VisibleForTesting
>   static final int MAX_TASK_ID_LENGTH = 255 - 90;
> 
> // TODO(maximk): This is a short-term hack to stop the bleeding from
> //   https://issues.apache.org/jira/browse/MESOS-691
> if (taskIdGenerator.generate(task, totalInstances).length() > 
> MAX_TASK_ID_LENGTH) {
>   throw new TaskValidationException(
>   "Task ID is too long, please shorten your role or job name.");
> }
> {noformat} 
> However [~codyg] recently 
> [asked|https://lists.apache.org/thread.html/ca92420fe6394d6467f70989e1ffadac23775e84cf7356ff8c9efdd5@%3Cdev.mesos.apache.org%3E]
>  on the mesos mailing list about MESOS-691 and learned that it is no longer 
> valid.
> We should remove this restriction.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (AURORA-1933) Scheduler can process rescind before offer

2017-06-05 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1933:


 Summary: Scheduler can process rescind before offer
 Key: AURORA-1933
 URL: https://issues.apache.org/jira/browse/AURORA-1933
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Zameer Manji


I observed the following in production:
{noformat}
Jun  6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.510 
[Thread-77638, MesosCallbackHandler$MesosCallbackHandlerImpl:229] Offer 
rescinded: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
Jun  6 00:31:32 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:32.903 
[SchedulerImpl-0, MesosCallbackHandler$MesosCallbackHandlerImpl:211] Received 
offer: 81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552
Jun  6 00:31:34 compute1159-dca1 aurora-scheduler[23675]: I0606 00:31:34.815 
[TaskGroupBatchWorker, VersionedSchedulerDriverService:123] Accepting offer 
81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 with ops [LAUNCH]
{noformat}

Notice the rescind was processed before the offer was given. This means the 
offer is in the offer storage, but using it is invalid. It will cause whatever 
task launched with it to fail with {{Task launched with invalid offers: Offer 
81e04cbd-9bce-41cf-bd94-38c911f255e4-O142359552 is no longer valid}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1914) Unable to specify multiple volumes per task.

2017-03-29 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1914:


 Summary: Unable to specify multiple volumes per task.
 Key: AURORA-1914
 URL: https://issues.apache.org/jira/browse/AURORA-1914
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


There is an artificial constraint in the schema which prevents multiple volumes 
per task. This was not caught before in testing. Removing the constraint should 
solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1914) Unable to specify multiple volumes per task.

2017-03-29 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1914?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1914:


Assignee: Zameer Manji

> Unable to specify multiple volumes per task.
> 
>
> Key: AURORA-1914
> URL: https://issues.apache.org/jira/browse/AURORA-1914
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> There is an artificial constraint in the schema which prevents multiple 
> volumes per task. This was not caught before in testing. Removing the 
> constraint should solve the problem.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1911) HTTP Scheduler Driver does not reliably re subscribe

2017-03-29 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1911?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15948114#comment-15948114
 ] 

Zameer Manji commented on AURORA-1911:
--

First part here: https://reviews.apache.org/r/58053/

> HTTP Scheduler Driver does not reliably re subscribe
> 
>
> Key: AURORA-1911
> URL: https://issues.apache.org/jira/browse/AURORA-1911
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> I observed this issue in a large production cluster during a period of Mesos 
> Master instability:
> 1. Mesos master crashes or restarts.
> 2. {{V1Mesos}} driver detects this and reconnects.
> 3. Aurora does the {{SUBSCRIBE}} call again.
> 4. The {{SUBSCRIBE}} Call fails silently in the driver.
> 5. All future calls are silently dropped by the driver.
> 6. Aurora has no offers because it is not subscribed.
> Logs:
> {noformat}
> I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at 
> http://10.162.14.30:5050/master/api/v1/scheduler
> W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service 
> Unavailable' () for SUBSCRIBE
> 
> W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> 
> W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> 
> W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is 
> in state CONNECTED
> ...
> {noformat}
> To fix this, the {{VersionedSchedulerDriver}} needs to do two things:
> 1. Block calls when unsubscribed not just disconnected.
> 2. Retry the {{SUBSCRIBE}} call repeatedly with exponential backoff.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1912) DbSnapShot may remove enum values

2017-03-29 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1912:


Assignee: Zameer Manji

> DbSnapShot may remove enum values
> -
>
> Key: AURORA-1912
> URL: https://issues.apache.org/jira/browse/AURORA-1912
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> The dbnsapshot restore may truncate enum tables and cause referential 
> integrity issues. From the code it restores from the SQL dump by first 
> dropping all tables:
> {noformat}
> try (Connection c = ((DataSource) 
> store.getUnsafeStoreAccess()).getConnection()) {
>   LOG.info("Dropping all tables");
>   try (PreparedStatement drop = c.prepareStatement("DROP ALL 
> OBJECTS")) {
> drop.executeUpdate();
>   }
> {noformat}
> However a freshly started leader will have some data in there from preparing 
> the storage:
> {noformat}
>   @Override
>   @Transactional
>   protected void startUp() throws IOException {
> Configuration configuration = sessionFactory.getConfiguration();
> String createStatementName = "create_tables";
> configuration.setMapUnderscoreToCamelCase(true);
> // The ReuseExecutor will cache jdbc Statements with equivalent SQL, 
> improving performance
> // slightly when redundant queries are made.
> configuration.setDefaultExecutorType(ExecutorType.REUSE);
> addMappedStatement(
> configuration,
> createStatementName,
> CharStreams.toString(new InputStreamReader(
> DbStorage.class.getResourceAsStream("schema.sql"),
> StandardCharsets.UTF_8)));
> try (SqlSession session = sessionFactory.openSession()) {
>   session.update(createStatementName);
> }
> for (CronCollisionPolicy policy : CronCollisionPolicy.values()) {
>   enumValueMapper.addEnumValue("cron_policies", policy.getValue(), 
> policy.name());
> }
> for (MaintenanceMode mode : MaintenanceMode.values()) {
>   enumValueMapper.addEnumValue("maintenance_modes", mode.getValue(), 
> mode.name());
> }
> for (JobUpdateStatus status : JobUpdateStatus.values()) {
>   enumValueMapper.addEnumValue("job_update_statuses", status.getValue(), 
> status.name());
> }
> for (JobUpdateAction action : JobUpdateAction.values()) {
>   enumValueMapper.addEnumValue("job_instance_update_actions", 
> action.getValue(), action.name());
> }
> for (ScheduleStatus status : ScheduleStatus.values()) {
>   enumValueMapper.addEnumValue("task_states", status.getValue(), 
> status.name());
> }
> for (ResourceType resourceType : ResourceType.values()) {
>   enumValueMapper.addEnumValue("resource_types", resourceType.getValue(), 
> resourceType.name());
> }
> for (Mode mode : Mode.values()) {
>   enumValueMapper.addEnumValue("volume_modes", mode.getValue(), 
> mode.name());
> }
> createPoolMetrics();
>   }
> {noformat}
> Consider the case where we add a new value to an existing enum. This means 
> restoring from a snapshot will not allow us to have that value in the enum 
> table. 
> To fix this we should have a migration for every enum value we add. However 
> to me it seems that the better idea would be to update the enum tables after 
> we restore from a snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1912) DbSnapShot may remove enum values

2017-03-29 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1912:


 Summary: DbSnapShot may remove enum values
 Key: AURORA-1912
 URL: https://issues.apache.org/jira/browse/AURORA-1912
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


The dbnsapshot restore may truncate enum tables and cause referential integrity 
issues. From the code it restores from the SQL dump by first dropping all 
tables:
{noformat}
try (Connection c = ((DataSource) 
store.getUnsafeStoreAccess()).getConnection()) {
  LOG.info("Dropping all tables");
  try (PreparedStatement drop = c.prepareStatement("DROP ALL 
OBJECTS")) {
drop.executeUpdate();
  }
{noformat}

However a freshly started leader will have some data in there from preparing 
the storage:
{noformat}
  @Override
  @Transactional
  protected void startUp() throws IOException {
Configuration configuration = sessionFactory.getConfiguration();
String createStatementName = "create_tables";
configuration.setMapUnderscoreToCamelCase(true);

// The ReuseExecutor will cache jdbc Statements with equivalent SQL, 
improving performance
// slightly when redundant queries are made.
configuration.setDefaultExecutorType(ExecutorType.REUSE);

addMappedStatement(
configuration,
createStatementName,
CharStreams.toString(new InputStreamReader(
DbStorage.class.getResourceAsStream("schema.sql"),
StandardCharsets.UTF_8)));

try (SqlSession session = sessionFactory.openSession()) {
  session.update(createStatementName);
}

for (CronCollisionPolicy policy : CronCollisionPolicy.values()) {
  enumValueMapper.addEnumValue("cron_policies", policy.getValue(), 
policy.name());
}

for (MaintenanceMode mode : MaintenanceMode.values()) {
  enumValueMapper.addEnumValue("maintenance_modes", mode.getValue(), 
mode.name());
}

for (JobUpdateStatus status : JobUpdateStatus.values()) {
  enumValueMapper.addEnumValue("job_update_statuses", status.getValue(), 
status.name());
}

for (JobUpdateAction action : JobUpdateAction.values()) {
  enumValueMapper.addEnumValue("job_instance_update_actions", 
action.getValue(), action.name());
}

for (ScheduleStatus status : ScheduleStatus.values()) {
  enumValueMapper.addEnumValue("task_states", status.getValue(), 
status.name());
}

for (ResourceType resourceType : ResourceType.values()) {
  enumValueMapper.addEnumValue("resource_types", resourceType.getValue(), 
resourceType.name());
}

for (Mode mode : Mode.values()) {
  enumValueMapper.addEnumValue("volume_modes", mode.getValue(), 
mode.name());
}

createPoolMetrics();
  }
{noformat}

Consider the case where we add a new value to an existing enum. This means 
restoring from a snapshot will not allow us to have that value in the enum 
table. 

To fix this we should have a migration for every enum value we add. However to 
me it seems that the better idea would be to update the enum tables after we 
restore from a snapshot.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1910) framework_registered metric isn't reset when scheduler disconnects

2017-03-28 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1910:
-
Summary: framework_registered metric isn't reset when scheduler disconnects 
 (was: framework_registered metric doesn't reset when scheduler disconnects)

> framework_registered metric isn't reset when scheduler disconnects
> --
>
> Key: AURORA-1910
> URL: https://issues.apache.org/jira/browse/AURORA-1910
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Right now the {{framework_registered}} metric transitions from 0 -> 1 when 
> the scheduler registers successfully the first time. It never transitions 
> from 1 -> 0 when it loses a connection.
> This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the 
> gauge as the scheduler loses registration and re-registers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1911) HTTP Scheduler Driver does not reliable re subscribe

2017-03-28 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1911:


 Summary: HTTP Scheduler Driver does not reliable re subscribe
 Key: AURORA-1911
 URL: https://issues.apache.org/jira/browse/AURORA-1911
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Zameer Manji


I observed this issue in a large production cluster during a period of Mesos 
Master instability:
1. Mesos master crashes or restarts.
2. {{V1Mesos}} driver detects this and reconnects.
3. Aurora does the {{SUBSCRIBE}} call again.
4. The {{SUBSCRIBE}} Call fails silently in the driver.
5. All future calls are silently dropped by the driver.
6. Aurora has no offers because it is not subscribed.

Logs:

{noformat}
I0328 19:40:55.473546 101404 scheduler.cpp:353] Connected with the master at 
http://10.162.14.30:5050/master/api/v1/scheduler
W0328 19:40:55.475898 101410 scheduler.cpp:583] Received '503 Service 
Unavailable' () for SUBSCRIBE

W0328 19:40:58.862393 101398 scheduler.cpp:508] Dropping KILL: Scheduler is in 
state CONNECTED

W0328 19:41:14.588474 101394 scheduler.cpp:508] Dropping KILL: Scheduler is in 
state CONNECTED

W0328 19:41:37.763464 101402 scheduler.cpp:508] Dropping KILL: Scheduler is in 
state CONNECTED
...
{noformat}

To fix this, the {{VersionedSchedulerDriver}} needs to do two things:
1. Block calls when unsubscribed not just disconnected.
2. Retry the {{SUBSCRIBE}} call repeatedly with exponential backoff.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1910) framework_registered metric doesn't reset when scheduler disconnects

2017-03-28 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1910:


 Summary: framework_registered metric doesn't reset when scheduler 
disconnects
 Key: AURORA-1910
 URL: https://issues.apache.org/jira/browse/AURORA-1910
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


Right now the {{framework_registered}} metric transitions from 0 -> 1 when the 
scheduler registers successfully the first time. It never transitions from 1 -> 
0 when it loses a connection.

This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the 
gauge as the scheduler loses registration and re-registers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1910) framework_registered metric doesn't reset when scheduler disconnects

2017-03-28 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1910:


Assignee: Zameer Manji

> framework_registered metric doesn't reset when scheduler disconnects
> 
>
> Key: AURORA-1910
> URL: https://issues.apache.org/jira/browse/AURORA-1910
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Right now the {{framework_registered}} metric transitions from 0 -> 1 when 
> the scheduler registers successfully the first time. It never transitions 
> from 1 -> 0 when it loses a connection.
> This metric is already a gauge of an {{AtomicBoolean}}. We should adjust the 
> gauge as the scheduler loses registration and re-registers.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1908) Short-circuit preemption filtering when a Veto applies to entire host

2017-03-22 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15937282#comment-15937282
 ] 

Zameer Manji commented on AURORA-1908:
--

We label {{Vetos}} with a {{VetoType}} which is {{STATIC}} or {{DYNAMIC}}.

To me this can be generalized to short circuit if all of the vetoes are 
{{STATIC}}

> Short-circuit preemption filtering when a Veto applies to entire host
> -
>
> Key: AURORA-1908
> URL: https://issues.apache.org/jira/browse/AURORA-1908
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When matching a {{ResourceRequest}} against a {{UnusedResource}} in 
> {{PremeptionVictimFilter.filterPremeptionVictions}} there are 4 kinds of 
> {{Veto}} es that can be returned. 3 out of the 4 {{Veto}} es apply to the 
> entire host (namely {{DEDICATED_CONSTRAINT_MISMATCH}}, {{MAINTENANCE}}, 
> {{LIMIT_NOT_SATISFIED}} or {{CONSTRAINT_MISMATCH}}). In this case we can 
> short-circuit and return early and move on to the next host to consider.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1905) Set "webui_url" field of FrameworkInfo

2017-03-16 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1905:


Assignee: Zameer Manji

https://reviews.apache.org/r/57708/

> Set "webui_url" field of FrameworkInfo
> --
>
> Key: AURORA-1905
> URL: https://issues.apache.org/jira/browse/AURORA-1905
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Aurora should set the {{webui_url}} field of FrameworkInfo so the Mesos UI 
> can link to the current leader.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1906) aurora update info command should print out update metadata

2017-03-15 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1906:


 Summary: aurora update info command should print out update 
metadata
 Key: AURORA-1906
 URL: https://issues.apache.org/jira/browse/AURORA-1906
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


AURORA-1711 added metadata fields to update request.

The CLI should allow users to inspect that metadata.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1905) Set "webui_url" field of FrameworkInfo

2017-03-14 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1905:


 Summary: Set "webui_url" field of FrameworkInfo
 Key: AURORA-1905
 URL: https://issues.apache.org/jira/browse/AURORA-1905
 Project: Aurora
  Issue Type: Task
Reporter: Zameer Manji


Aurora should set the {{webui_url}} field of FrameworkInfo so the Mesos UI can 
link to the current leader.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1904) Support Mesos Maintenance

2017-03-13 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1904:


 Summary: Support Mesos Maintenance
 Key: AURORA-1904
 URL: https://issues.apache.org/jira/browse/AURORA-1904
 Project: Aurora
  Issue Type: Task
Reporter: Zameer Manji
Priority: Minor


Support Mesos Maintenance primitives in Aurora per the design 
[doc|https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumkIbHrk0-ceoD2YI/edit#heading=h.n5tvzjaj9llx].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1904) Support Mesos Maintenance

2017-03-13 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1904?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1904:


Assignee: Zameer Manji

> Support Mesos Maintenance
> -
>
> Key: AURORA-1904
> URL: https://issues.apache.org/jira/browse/AURORA-1904
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
>
> Support Mesos Maintenance primitives in Aurora per the design 
> [doc|https://docs.google.com/document/d/1Z7dFAm6I1nrBE9S5WHw0D0LApBumkIbHrk0-ceoD2YI/edit#heading=h.n5tvzjaj9llx].



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1903) Allow for RootFs to be set for mesos filesystem tasks

2017-03-10 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15905912#comment-15905912
 ] 

Zameer Manji commented on AURORA-1903:
--

https://reviews.apache.org/r/57524/

> Allow for RootFs to be set for mesos filesystem tasks
> -
>
> Key: AURORA-1903
> URL: https://issues.apache.org/jira/browse/AURORA-1903
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Attachments: table.png
>
>
> Currently when a TaskConfig is for a Mesos container task and it has an 
> image. We currently place the image as a volume mounted at {{taskfs}} in the 
> sandbox. Thermos, or other executors are launched outside the image and then 
> are expected to chroot into the {{taskfs}} directory.
> However I think it would be a fine addition to allow executors to set the 
> {{image}} property of the Mesos container instead of putting the image as a 
> volume. This enables some tasks to get around a limitation of the 
> MesosContainerizer where certain container paths must already exist in the 
> image and the host.
> See the 
> [documentation|http://mesos.apache.org/documentation/latest/docker-volume/] 
> for the table that describes this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1903) Allow for RootFs to be set for mesos filesystem tasks

2017-03-10 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1903:
-
Attachment: table.png

> Allow for RootFs to be set for mesos filesystem tasks
> -
>
> Key: AURORA-1903
> URL: https://issues.apache.org/jira/browse/AURORA-1903
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
> Attachments: table.png
>
>
> Currently when a TaskConfig is for a Mesos container task and it has an 
> image. We currently place the image as a volume mounted at {{taskfs}} in the 
> sandbox. Thermos, or other executors are launched outside the image and then 
> are expected to chroot into the {{taskfs}} directory.
> However I think it would be a fine addition to allow executors to set the 
> {{image}} property of the Mesos container instead of putting the image as a 
> volume. This enables some tasks to get around a limitation of the 
> MesosContainerizer where certain container paths must already exist in the 
> image and the host.
> See the 
> [documentation|http://mesos.apache.org/documentation/latest/docker-volume/] 
> for the table that describes this.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1903) Allow for RootFs to be set for mesos filesystem tasks

2017-03-10 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1903:


 Summary: Allow for RootFs to be set for mesos filesystem tasks
 Key: AURORA-1903
 URL: https://issues.apache.org/jira/browse/AURORA-1903
 Project: Aurora
  Issue Type: Task
Reporter: Zameer Manji
Assignee: Zameer Manji


Currently when a TaskConfig is for a Mesos container task and it has an image. 
We currently place the image as a volume mounted at {{taskfs}} in the sandbox. 
Thermos, or other executors are launched outside the image and then are 
expected to chroot into the {{taskfs}} directory.

However I think it would be a fine addition to allow executors to set the 
{{image}} property of the Mesos container instead of putting the image as a 
volume. This enables some tasks to get around a limitation of the 
MesosContainerizer where certain container paths must already exist in the 
image and the host.

See the 
[documentation|http://mesos.apache.org/documentation/latest/docker-volume/] for 
the table that describes this.







--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1902) Docker containers with not newest OS fails to run

2017-03-08 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1902?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15901918#comment-15901918
 ] 

Zameer Manji commented on AURORA-1902:
--

This is a known flaw/limitation with Mesos and the DockerConainerizer. Mesos 
will copy/mount the executor into the docker filesystem, meaning that the 
filesystem needs to be capable of launching the executor. In our case it needs 
to have Python 2.7 and the dependencies for libmesos.

Tasks launched with the MesosContainerizer do not suffer from this limitation.

> Docker containers with not newest OS fails to run
> -
>
> Key: AURORA-1902
> URL: https://issues.apache.org/jira/browse/AURORA-1902
> Project: Aurora
>  Issue Type: Bug
>  Components: Docker, Executor
>Affects Versions: 0.17.0
> Environment: Ubuntu: 16.04
> Mesos: 1.1.0
> Aurora: 0.17.0
> Dockerengine: 1.13.1
>Reporter: Mikhail Lesyk
>
> When trying to launch Docker containers, got an error:
> {code}
> I0308 21:47:56.695737  3888 fetcher.cpp:498] Fetcher Info: 
> {"cache_directory":"\/tmp\/mesos\/fetch\/slaves\/7cbc133f-24ac-4937-aa28-09e8c81b647b-S7","items":[{"action":"BYPASS_CACHE","uri":{"executable":true,"extract":true,"value":"\/usr\/share\/aurora\/bin\/thermos_executor.pex"}}],"sandbox_directory":"\/var\/lib\/mesos\/slaves\/7cbc133f-24ac-4937-aa28-09e8c81b647b-S7\/frameworks\/47934424-623f-4fcb-9326-bf668149fc77-\/executors\/thermos-root-prod-test-0-a2c19f58-aa6c-45d8-a47f-8cf57dc0c261\/runs\/788d2f72-a6eb-4f3e-999c-17158e473661"}
> I0308 21:47:56.701079  3888 fetcher.cpp:409] Fetching URI 
> '/usr/share/aurora/bin/thermos_executor.pex'
> I0308 21:47:56.701162  3888 fetcher.cpp:250] Fetching directly into the 
> sandbox directory
> I0308 21:47:56.701225  3888 fetcher.cpp:187] Fetching URI 
> '/usr/share/aurora/bin/thermos_executor.pex'
> I0308 21:47:56.701282  3888 fetcher.cpp:167] Copying resource with command:cp 
> '/usr/share/aurora/bin/thermos_executor.pex' 
> '/var/lib/mesos/slaves/7cbc133f-24ac-4937-aa28-09e8c81b647b-S7/frameworks/47934424-623f-4fcb-9326-bf668149fc77-/executors/thermos-root-prod-test-0-a2c19f58-aa6c-45d8-a47f-8cf57dc0c261/runs/788d2f72-a6eb-4f3e-999c-17158e473661/thermos_executor.pex'
> I0308 21:47:56.730024  3888 fetcher.cpp:547] Fetched 
> '/usr/share/aurora/bin/thermos_executor.pex' to 
> '/var/lib/mesos/slaves/7cbc133f-24ac-4937-aa28-09e8c81b647b-S7/frameworks/47934424-623f-4fcb-9326-bf668149fc77-/executors/thermos-root-prod-test-0-a2c19f58-aa6c-45d8-a47f-8cf57dc0c261/runs/788d2f72-a6eb-4f3e-999c-17158e473661/thermos_executor.pex'
> WARNING: Your kernel does not support swap limit capabilities or the cgroup 
> is not mounted. Memory limited without swap.
> Traceback (most recent call last):
>   File "apache/aurora/executor/bin/thermos_executor_main.py", line 45, in 
> 
> from mesos.executor import MesosExecutorDriver
>   File 
> "/root/.pex/install/mesos.executor-1.1.0-py2.7-linux-x86_64.egg.47fa022c99c11c7faddf379cbfc46a25c5f215be/mesos.executor-1.1.0-py2.7-linux-x86_64.egg/mesos/executor/__init__.py",
>  line 17, in 
> from ._executor import MesosExecutorDriverImpl as MesosExecutorDriver
> ImportError: /usr/lib/x86_64-linux-gnu/libstdc++.so.6: version 
> `GLIBCXX_3.4.20' not found (required by 
> /root/.pex/install/mesos.executor-1.1.0-py2.7-linux-x86_64.egg.47fa022c99c11c7faddf379cbfc46a25c5f215be/mesos.executor-1.1.0-py2.7-linux-x86_64.egg/mesos/executor/_executor.so)
> twitter.common.app debug: Initializing: twitter.common.log (Logging 
> subsystem.)
> Writing log files to disk in /mnt/mesos/sandbox
> thermos_executor.pex: error: Could not load MesosExecutorDriver!
> twitter.common.app debug: main sys.exited
> twitter.common.app debug: Shutting application down.
> twitter.common.app debug: Running exit function for twitter.common.log 
> (Logging subsystem.)
> twitter.common.app debug: Finishing up module teardown.
> twitter.common.app debug:   Active thread: <_MainThread(MainThread, started 
> 140218447816512)>
> twitter.common.app debug: Exiting cleanly.
> {code}
> Tested affected systems(with absent of GLIBCXX_3.4.20, GLIBCXX_3.4.21):
> Debian 8
> Ubuntu 14.04
> How to reproduce:
> 1) Prepare Docker image with Python 2.7. Example of Dockerfile:
> {code}
> FROM ubuntu:14.04
> RUN apt-get -y update && apt-get -y install python2.7
> {code}
> 2) build and push image to some repo, example:
> {code}
> docker build -t mlesyk/ubuntu:14.04 . && docker push mlesyk/ubuntu:14.04
> {code}
> 3) create some job with Docker container with any command to run, for 
> example, 
> {code}
> sleep 60
> {code}
> and appropriate container parameter, for example:
> {code}
> container = Docker(image='mlesyk/ubuntu:14.04')
> {code}
> 4) Run this job in Aurora and observe error from beginning of ticket



--
This message was sent by Atlassian 

[jira] [Commented] (AURORA-1899) Expose per role metrics around Thrift activity

2017-03-03 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15894777#comment-15894777
 ] 

Zameer Manji commented on AURORA-1899:
--

I support this idea, and we can put it behind a flag like what we do for 
various kinds of SLA metrics.

[~StephanErb]: Consider the case where a single role/user launches 30k 10k non 
prod tasks at the same time. You can observe the aggregate change in the 
current metrics, but only the logs will tell you who did it.

> Expose per role metrics around Thrift activity
> --
>
> Key: AURORA-1899
> URL: https://issues.apache.org/jira/browse/AURORA-1899
> Project: Aurora
>  Issue Type: Task
>Reporter: David McLaughlin
>
> It's currently pretty easy for a single client to cause havoc on an Aurora 
> cluster. We triage most of these issues by grepping the Scheduler logs for 
> Thrift API calls and finding patterns around role names. 
> Figuring out what changed would be a lot easier if we could take the current 
> Thrift API metrics and export an additional metric for each one that is 
> scoped by the role. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1887) Create Driver implementation around V0Mesos.

2017-03-02 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1887?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15893209#comment-15893209
 ] 

Zameer Manji commented on AURORA-1887:
--

{noformat}
commit 705dbc7cd7c3ff477bcf766cdafe49a68ab47dee
Author: Zameer Manji 
Date:   Thu Mar 2 15:07:11 2017 -0800

Enable Mesos HTTP API.

This patch completes the design doc[1] and enables operators to choose 
between
two V1 Mesos API implementations. The first is `V0Mesos` which offers the 
V1 API
backed by the scheduler driver and the second is `V1Mesos` which offers the 
V1
API backed by a new HTTP API implementation.

There are three sets of changes in this patch.

First, the V1 Mesos code requires a Scheduler callback with a different 
API. To
maximize code reuse, event handling logic was extracted into a
`MesosCallbackHandler` class. `VersionedMesosSchedulerImpl` was created to
implement the new callback interface. Both callbacks new use the handler 
class
for logic.

Second, a new driver implementation using the new API was created. All of 
the
logic for the new driver is encapsulated in the
`VersionedSchedulerDriverService` class.

Third, some wiring changes were done to allow for Guice to do it's work and
allow for operators to select between the different driver implementations.

[1] 
https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_0qMxAlnMW70eOKoU3myo

Testing Done:
The e2e test has been run three times, each time with a different driver 
option.

Bugs closed: AURORA-1887, AURORA-1888

Reviewed at https://reviews.apache.org/r/57061/

 RELEASE-NOTES.md   |   7 +
 examples/vagrant/upstart/aurora-scheduler.conf |   5 +-
 .../aurora/benchmark/StatusUpdateBenchmark.java|   6 +-
 .../org/apache/aurora/scheduler/app/AppModule.java |  12 +-
 .../apache/aurora/scheduler/app/SchedulerMain.java |  22 +-
 .../scheduler/mesos/LibMesosLoadingModule.java |  29 +-
 .../scheduler/mesos/MesosCallbackHandler.java  | 288 ++
 .../aurora/scheduler/mesos/MesosSchedulerImpl.java | 212 +-
 .../aurora/scheduler/mesos/ProtosConversion.java   |  28 ++
 .../scheduler/mesos/SchedulerDriverModule.java |  50 ++-
 ...dingModule.java => VersionedDriverFactory.java} |  20 +-
 .../mesos/VersionedMesosSchedulerImpl.java | 198 ++
 .../mesos/VersionedSchedulerDriverService.java | 254 
 .../apache/aurora/scheduler/app/SchedulerIT.java   |   7 +-
 .../scheduler/mesos/MesosCallbackHandlerTest.java  | 430 +
 .../scheduler/mesos/MesosSchedulerImplTest.java| 424 
 .../mesos/VersionedMesosSchedulerImplTest.java | 275 +
 .../mesos/VersionedSchedulerDriverServiceTest.java | 194 ++
 .../apache/aurora/scheduler/thrift/ThriftIT.java   |   3 +-
 19 files changed, 1888 insertions(+), 576 deletions(-)
{noformat}

> Create Driver implementation around V0Mesos.
> 
>
> Key: AURORA-1887
> URL: https://issues.apache.org/jira/browse/AURORA-1887
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Create an implementation of the {{org.apache.aurora.scheduler.mesos.Driver}} 
> interface which uses the {{V0Mesos}} shim under the hood. Provide a flag to 
> switch between the two to show there is no regression.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1897) Remove task length restrictions.

2017-03-01 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1897:


 Summary: Remove task length restrictions.
 Key: AURORA-1897
 URL: https://issues.apache.org/jira/browse/AURORA-1897
 Project: Aurora
  Issue Type: Task
Reporter: Zameer Manji
Priority: Minor


Currently we restrict the total name of a task because of a Mesos bug:
{noformat}
  // This number is derived from the maximum file name length limit on most 
UNIX systems, less
  // the number of characters we've observed being added by mesos for the 
executor ID, prefix, and
  // delimiters.
  @VisibleForTesting
  static final int MAX_TASK_ID_LENGTH = 255 - 90;

// TODO(maximk): This is a short-term hack to stop the bleeding from
//   https://issues.apache.org/jira/browse/MESOS-691
if (taskIdGenerator.generate(task, totalInstances).length() > 
MAX_TASK_ID_LENGTH) {
  throw new TaskValidationException(
  "Task ID is too long, please shorten your role or job name.");
}
{noformat} 

However [~codyg] recently 
[asked|https://lists.apache.org/thread.html/ca92420fe6394d6467f70989e1ffadac23775e84cf7356ff8c9efdd5@%3Cdev.mesos.apache.org%3E]
 on the mesos mailing list about MESOS-691 and learned that it is no longer 
valid.

We should remove this restriction.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (AURORA-1860) Fix bug in scheduler driver disconnect stats

2017-02-27 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1860?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji resolved AURORA-1860.
--
Resolution: Fixed

{noformat}
commit 2652fe02a2255992e187fede2bae8ff6aef2862c
Author: Ilya Pronin 
Date:   Mon Feb 27 11:04:54 2017 -0800

Fix scheduler_framework_disconnects stat.

Refactoring in r/31550 has disabled incrementing 
scheduler_framework_disconnects
stats. This change brings it back.

Testing Done:
Added a check to `MesosSchedulerImplTest.testDisconnected()`. Manually 
verified
in Vagrant by starting/stopping mesos-master and querying `/vars` endpoint.

Bugs closed: AURORA-1860

Reviewed at https://reviews.apache.org/r/57074/

 .../java/org/apache/aurora/scheduler/mesos/MesosSchedulerImpl.java | 2 +-
 .../java/org/apache/aurora/scheduler/mesos/MesosSchedulerImplTest.java | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)
{noformat}

> Fix bug in scheduler driver disconnect stats
> 
>
> Key: AURORA-1860
> URL: https://issues.apache.org/jira/browse/AURORA-1860
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Assignee: Ilya Pronin
>Priority: Minor
>  Labels: newbie
>
> Correct the refactoring mistake introduced in 
> [https://reviews.apache.org/r/31550/] that has disabled 
> {{scheduler_framework_disconnects}} stats:
> {code:title=MesosSchedulerImpl.disconnected()}
> counters.get("scheduler_framework_disconnects").get();
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1890) Job Update Pulse History is initialized to no pulses on scheduler recovery

2017-02-14 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1890:
-
Description: 
I have experienced the following problem with pulse updates. To reproduce:
1. Create an update with a pulse timeout of 1h
2. Send a pulse to get the update going.
3. Failover the scheduler immediately after.
4. Observe that the update is awaiting another pulse right after the failover.

This is because the {{JobUpdateControllerImpl}} stores pulse history and state 
in memory in {{PulseHandler}}. On scheduler startup, the pulse state is reset 
to no pulse received.

We can solve this by inferring the timestamp of the last pulse by inspecting 
the job update events.

  was:
I have experienced the following problem with pulse updates. To reproduce:
1. Create an update with a pulse timeout of 1h
2. Send a pulse to get the update going.
3. Failover the scheduler immediately after.
4. Observe that the update is awaiting another pulse right after the failover.

This is because the {{JobUpdateControllerImpl}} stores pulse history and state 
in memory in {{PulseHandler}}. On scheduler startup, the pulse state is reset 
to no pulse received.

We can solve this by durably storing the timestamp of the last pulse received 
in storage.


> Job Update Pulse History is initialized to no pulses on scheduler recovery
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by inferring the timestamp of the last pulse by inspecting 
> the job update events.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1890) Job Update Pulse History is initialized to no pulses on scheduler recovery

2017-02-14 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1890:


Assignee: Zameer Manji

> Job Update Pulse History is initialized to no pulses on scheduler recovery
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by inferring the timestamp of the last pulse by inspecting 
> the job update events.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1890) Job Update Pulse History is initialized to no pulses on scheduler recovery

2017-02-14 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1890:
-
Summary: Job Update Pulse History is initialized to no pulses on scheduler 
recovery  (was: Job Update Pulse History is not durably stored)

> Job Update Pulse History is initialized to no pulses on scheduler recovery
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1891) Unable to upgrade Guava

2017-02-13 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1891:


 Summary: Unable to upgrade Guava
 Key: AURORA-1891
 URL: https://issues.apache.org/jira/browse/AURORA-1891
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji
Priority: Minor


Guava 21 is out and with better Java 8 integration.

I cannot upgrade us. Bumping the dependency results in:

{noformat}
/Users/zmanji/code/aurora/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:82:
 error: cannot find symbol
class WriteAheadStorage extends WriteAheadStorageForwarder implements
^
  symbol: class WriteAheadStorageForwarder
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith': 
class file for com.google.errorprone.annotations.CompatibleWith not found
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multiset.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multiset.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/code/aurora/src/main/java/org/apache/aurora/scheduler/storage/log/WriteAheadStorage.java:74:
 Note: Wrote forwarder 
org.apache.aurora.scheduler.storage.log.WriteAheadStorageForwarder
@Forward({
^
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith': 
class file for com.google.errorprone.annotations.CompatibleWith not found
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'
/Users/zmanji/.gradle/caches/modules-2/files-2.1/com.google.guava/guava/21.0/3a3d111be1be1b745edfa7d91678a12d7ed38709/guava-21.0.jar(com/google/common/collect/Multimap.class):
 warning: Cannot find annotation method 'value()' in type 'CompatibleWith'

[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-13 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864598#comment-15864598
 ] 

Zameer Manji commented on AURORA-1890:
--

I would be content with initializing the {{PulseState}} timestamp with the 
timestamp of the most recent event that transitioned from a 
{{BLOCKED_AWAITING_PULSE}}.

I feel this is more correct than what we do now, avoids hashing out some 
storage changes, and is suitable for my current usecase.

If you confirm that you agree, I can rephrase this ticket to better capture 
what the fix would be.

> Job Update Pulse History is not durably stored
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-13 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15864537#comment-15864537
 ] 

Zameer Manji commented on AURORA-1890:
--

The scheduler does the right thing on first pulse. However on failover, any 
coordinated updates are immediately sent to BLOCKED_AWAITING_PULSE. This is 
because on scheduler startup pulse state is reset to no pulse received. The 
code sets the timestamp to the last pulse received to 0L:

{noformat}
synchronized void initializePulseState(IJobUpdate update, JobUpdateStatus 
status) {
  pulseStates.put(update.getSummary().getKey(), new PulseState(
  status,
  update.getInstructions().getSettings().getBlockIfNoPulsesAfterMs(),
  0L));
}
{noformat}

Would it be ok to set the timestamp to the first event after the most recent 
{{BLOCKED_AWAITING_PULSE}}? We know for sure at that point in time that a pulse 
was received because of the state transition from {{BLCOKED_AWAITING_PULSE}} to 
some other event.

Also could you describe "significant" write volume? I can imagine if the pulse 
interval was in the seconds and there are thousands of updates perhaps it would 
be too much. However we could prevent excessively small pulse intervals.

> Job Update Pulse History is not durably stored
> --
>
> Key: AURORA-1890
> URL: https://issues.apache.org/jira/browse/AURORA-1890
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>
> I have experienced the following problem with pulse updates. To reproduce:
> 1. Create an update with a pulse timeout of 1h
> 2. Send a pulse to get the update going.
> 3. Failover the scheduler immediately after.
> 4. Observe that the update is awaiting another pulse right after the failover.
> This is because the {{JobUpdateControllerImpl}} stores pulse history and 
> state in memory in {{PulseHandler}}. On scheduler startup, the pulse state is 
> reset to no pulse received.
> We can solve this by durably storing the timestamp of the last pulse received 
> in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1890) Job Update Pulse History is not durably stored

2017-02-12 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1890:


 Summary: Job Update Pulse History is not durably stored
 Key: AURORA-1890
 URL: https://issues.apache.org/jira/browse/AURORA-1890
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


I have experienced the following problem with pulse updates. To reproduce:
1. Create an update with a pulse timeout of 1h
2. Send a pulse to get the update going.
3. Failover the scheduler immediately after.
4. Observe that the update is awaiting another pulse right after the failover.

This is because the {{JobUpdateControllerImpl}} stores pulse history and state 
in memory in {{PulseHandler}}. On scheduler startup, the pulse state is reset 
to no pulse received.

We can solve this by durably storing the timestamp of the last pulse received 
in storage.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Resolved] (AURORA-1846) Add message parameter to killTasks RPC

2017-02-06 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji resolved AURORA-1846.
--
Resolution: Fixed

This is fixed on master:
{noformat}
commit f88b7f3bf5b7a7db6e422e38cbf22cf809f8ff87
Author: Cody Gibb 
Date:   Mon Feb 6 10:43:01 2017 -0800

Add message parameter to killTasks

RPC's such as pauseJobUpdate include a parameter for "a user-specified 
message
to include with the induced job update state change." This diff provides a
similar optional parameter for the killTasks RPC, which allows users to 
indicate
the reason why a task was killed, and later inspect that reason when 
consuming
task events.

Example usage from Aurora CLI:
`$ aurora job killall devcluster/www-data/prod/hello --message "Some 
message"`

In the task event, the supplied message (if provided) is appended to the
existing template "Killed by ", separated by a newline. For the above
example, this looks like: "Killed by aurora\nSome message".

Testing Done:
Added a unit test in the scheduler, and a test in the client.

Also manually tested using the Vagrant environment.

Bugs closed: AURORA-1846

Reviewed at https://reviews.apache.org/r/54459/

 RELEASE-NOTES.md   |  7 +++
 .../main/thrift/org/apache/aurora/gen/api.thrift   |  2 +-
 .../aurora/scheduler/thrift/AuditMessages.java |  6 ++-
 .../scheduler/thrift/SchedulerThriftInterface.java |  8 +++-
 .../scheduler/thrift/aop/AnnotatedAuroraAdmin.java |  3 +-
 .../python/apache/aurora/client/api/__init__.py|  4 +-
 src/main/python/apache/aurora/client/cli/jobs.py   | 10 +++--
 .../apache/aurora/client/hooks/hooked_api.py   |  9 ++--
 .../http/api/security/HttpSecurityIT.java  | 21 -
 .../ShiroAuthorizingParamInterceptorTest.java  |  4 +-
 .../aurora/scheduler/thrift/AuditMessagesTest.java | 26 ++-
 .../thrift/SchedulerThriftInterfaceTest.java   | 27 +---
 src/test/python/apache/aurora/api_util.py  |  2 +-
 .../aurora/client/api/test_scheduler_client.py | 10 ++---
 .../python/apache/aurora/client/cli/test_kill.py   | 50 --
 .../apache/aurora/client/hooks/test_hooked_api.py  |  2 +-
 .../aurora/client/hooks/test_non_hooked_api.py |  6 +--
 .../sh/org/apache/aurora/e2e/test_end_to_end.sh| 10 -
 18 files changed, 146 insertions(+), 61 deletions(-)
{noformat}



> Add message parameter to killTasks RPC
> --
>
> Key: AURORA-1846
> URL: https://issues.apache.org/jira/browse/AURORA-1846
> Project: Aurora
>  Issue Type: Task
>  Components: Client, Scheduler
>Affects Versions: 0.16.0
>Reporter: Cody Gibb
>Assignee: Cody Gibb
>Priority: Minor
>
> RPC's such as pauseJobUpdate include a parameter for "a user-specified 
> message to include with the induced job update state change." Having a 
> similar parameter for killTasks would allow us to indicate the reason why a 
> task was killed, and later inspect that reason when querying 
> getTasksWithoutConfigs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (AURORA-1886) Migrate Aurora to use V1 protobufs

2017-02-02 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1886?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1886:
-
Issue Type: Task  (was: Story)

> Migrate Aurora to use V1 protobufs
> --
>
> Key: AURORA-1886
> URL: https://issues.apache.org/jira/browse/AURORA-1886
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>
> To migrate to the V1 API, Aurora needs to start using the V1 protobufs.
> The Driver interface and Scheduler callback from mesos will accept 
> unversioned protobufs and convert them when required.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1886) Migrate Aurora to use V1 protobufs

2017-02-02 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1886:


 Summary: Migrate Aurora to use V1 protobufs
 Key: AURORA-1886
 URL: https://issues.apache.org/jira/browse/AURORA-1886
 Project: Aurora
  Issue Type: Story
Reporter: Zameer Manji


To migrate to the V1 API, Aurora needs to start using the V1 protobufs.

The Driver interface and Scheduler callback from mesos will accept unversioned 
protobufs and convert them when required.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (AURORA-1885) Support the Mesos V1 API

2017-02-02 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1885:


Assignee: Zameer Manji

> Support the Mesos V1 API
> 
>
> Key: AURORA-1885
> URL: https://issues.apache.org/jira/browse/AURORA-1885
> Project: Aurora
>  Issue Type: Epic
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> This ticket tracks the work outlined in the design doc: 
> https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_0qMxAlnMW70eOKoU3myo/edit#heading=h.itk6ht9i1yha



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (AURORA-1885) Support the Mesos V1 API

2017-02-02 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1885:


 Summary: Support the Mesos V1 API
 Key: AURORA-1885
 URL: https://issues.apache.org/jira/browse/AURORA-1885
 Project: Aurora
  Issue Type: Epic
Reporter: Zameer Manji


This ticket tracks the work outlined in the design doc: 
https://docs.google.com/document/d/1bWK8ldaQSsRXvdKwTh8tyR_0qMxAlnMW70eOKoU3myo/edit#heading=h.itk6ht9i1yha



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (AURORA-1669) Kill twitter/commons ZK libs when Curator replacements are vetted

2017-01-18 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15828697#comment-15828697
 ] 

Zameer Manji commented on AURORA-1669:
--

I'm unable to complete the diff, I'm hoping [~jsirois] can guide it to 
completion.

> Kill twitter/commons ZK libs when Curator replacements are vetted
> -
>
> Key: AURORA-1669
> URL: https://issues.apache.org/jira/browse/AURORA-1669
> Project: Aurora
>  Issue Type: Task
>Reporter: John Sirois
>Assignee: John Sirois
> Fix For: 0.17.0
>
>
> Once we have reports from production users that the Curator zk plumbing 
> introduced in AURORA-1468 is working well, the {{-zk_use_curator}} flag 
> should be deprecated and then the flag and commons code killed.  If the 
> vetting happens before the next release ({{0.14.0}}), we can dispense with a 
> deprecation cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1799) Thermos does not handle low memory scenarios gracefully

2017-01-17 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15827017#comment-15827017
 ] 

Zameer Manji commented on AURORA-1799:
--

Today [~benley] reported something similar in Slack:

{noformat}
ERROR] Failed to stop health checkers:
ERROR] Traceback (most recent call last):
  File "apache/aurora/executor/aurora_executor.py", line 192, in _shutdown
propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
  File "apache/aurora/executor/aurora_executor.py", line 35, in 
propagate_deadline
return deadline(*args, daemon=True, propagate=True, **kw)
  File 
"/root/.pex/install/twitter.common.concurrent-0.3.3-py2-none-any.whl.33d9c24da69d7478b4aa6d76f474f3773a61f6f9/twitter.common.concurrent-0.3.3-py2-none-any.whl/twitter/common/concurrent/dead
line.py", line 61, in deadline
AnonymousThread().start()
  File "/usr/lib/python2.7/threading.py", line 745, in start
_start_new_thread(self.__bootstrap, ())
error: can't start new thread
ERROR] Failed to stop runner:
ERROR] Traceback (most recent call last):
  File "apache/aurora/executor/aurora_executor.py", line 200, in _shutdown
propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
  File "apache/aurora/executor/aurora_executor.py", line 35, in 
propagate_deadline
return deadline(*args, daemon=True, propagate=True, **kw)
  File 
"/root/.pex/install/twitter.common.concurrent-0.3.3-py2-none-any.whl.33d9c24da69d7478b4aa6d76f474f3773a61f6f9/twitter.common.concurrent-0.3.3-py2-none-any.whl/twitter/common/concurrent/dead
line.py", line 61, in deadline
AnonymousThread().start()
  File "/usr/lib/python2.7/threading.py", line 745, in start
_start_new_thread(self.__bootstrap, ())
error: can't start new thread
Traceback (most recent call last):
  File 
"/root/.pex/install/twitter.common.exceptions-0.3.3-py2-none-any.whl.57572b1f0a301c36c91adf2c704d0e8dd4d48429/twitter.common.exceptions-0.3.3-py2-none-any.whl/twitter/common/exceptions/__in
it__.py", line 126, in _excepting_run
self.__real_run(*args, **kw)
  File "apache/aurora/executor/status_manager.py", line 50, in run
  File "apache/aurora/executor/aurora_executor.py", line 218, in _shutdown
  File 
"/root/.pex/install/twitter.common.concurrent-0.3.3-py2-none-any.whl.33d9c24da69d7478b4aa6d76f474f3773a61f6f9/twitter.common.concurrent-0.3.3-py2-none-any.whl/twitter/common/concurrent/defe
rred.py", line 56, in defer
deferred.start()
  File "/usr/lib/python2.7/threading.py", line 745, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread
Traceback (most recent call last):
  File 
"/root/.pex/install/twitter.common.exceptions-0.3.3-py2-none-any.whl.57572b1f0a301c36c91adf2c704d0e8dd4d48429/twitter.common.exceptions-0.3.3-py2-none-any.whl/twitter/common/exceptions/__in
it__.py", line 126, in _excepting_run
self.__real_run(*args, **kw)
  File "apache/thermos/monitoring/resource.py", line 239, in run
  File 
"/root/.pex/install/twitter.common.concurrent-0.3.3-py2-none-any.whl.33d9c24da69d7478b4aa6d76f474f3773a61f6f9/twitter.common.concurrent-0.3.3-py2-none-any.whl/twitter/common/concurrent/even
t_muxer.py", line 79, in wait
thread.start()
  File "/usr/lib/python2.7/threading.py", line 745, in start
_start_new_thread(self.__bootstrap, ())
thread.error: can't start new thread

E0116 20:46:46.56877534 socket.hpp:174] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]
E0116 20:46:51.78901634 socket.hpp:174] Shutdown failed on fd=14: Transport 
endpoint is not connected [107]
E0116 20:50:47.90499934 socket.hpp:174] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]
E0116 20:50:48.09745734 socket.hpp:174] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]
E0116 20:50:50.27705334 socket.hpp:174] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]
E0116 20:50:51.00681634 socket.hpp:174] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]
E0116 20:50:51.02212334 socket.hpp:174] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]
E0116 20:50:51.24417934 socket.hpp:174] Shutdown failed on fd=13: Transport 
endpoint is not connected [107]
E0116 20:50:55.40700634 socket.hpp:174] Shutdown failed on fd=14: Transport 
endpoint is not connected [107]
E0116 20:50:55.41075934 socket.hpp:174] Shutdown failed on fd=15: Transport 
endpoint is not connected [107]
E0116 20:50:56.70334834 socket.hpp:174] Shutdown failed on fd=14: Transport 
endpoint is not connected [107]
E0116 20:50:56.70747134 socket.hpp:174] Shutdown failed on fd=15: Transport 
endpoint is not connected [107]
E0116 20:50:56.71240634 socket.hpp:174] Shutdown failed on fd=16: Transport 
endpoint is not connected [107]
E0116 20:50:57.05304534 socket.hpp:174] Shutdown failed on fd=14: Transport 
endpoint is not 

[jira] [Commented] (AURORA-1858) Expose stats on offers known to scheduler

2016-12-14 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15749644#comment-15749644
 ] 

Zameer Manji commented on AURORA-1858:
--

Isn't this what the "outstanding_offers" metric is?

> Expose stats on offers known to scheduler
> -
>
> Key: AURORA-1858
> URL: https://issues.apache.org/jira/browse/AURORA-1858
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
>
> Expose stats on the number of offers tracked by {{OfferManager}}. This can 
> simply be defined as a collection size gauge on {{offers}} set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1806) Enhance Aurora KILLED message for tasks killed for update.

2016-12-08 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1806:
-
Assignee: Abhishek Jain

> Enhance Aurora KILLED message for tasks killed for update.
> --
>
> Key: AURORA-1806
> URL: https://issues.apache.org/jira/browse/AURORA-1806
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Zameer Manji
>Assignee: Abhishek Jain
>Priority: Trivial
>  Labels: newbie
>
> Right now if a task is killed for an update the message in the UI and task 
> storage is "Killed for job update.".
> This should be enhanced to include the update id.
> Currently, I see the timestamp of the kill and then look at the update 
> history to see which update caused it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1806) Enhance Aurora KILLED message for tasks killed for update.

2016-12-08 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15734040#comment-15734040
 ] 

Zameer Manji commented on AURORA-1806:
--

Done.

> Enhance Aurora KILLED message for tasks killed for update.
> --
>
> Key: AURORA-1806
> URL: https://issues.apache.org/jira/browse/AURORA-1806
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Zameer Manji
>Assignee: Abhishek Jain
>Priority: Trivial
>  Labels: newbie
>
> Right now if a task is killed for an update the message in the UI and task 
> storage is "Killed for job update.".
> This should be enhanced to include the update id.
> Currently, I see the timestamp of the kill and then look at the update 
> history to see which update caused it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1847) Eliminate sequential scan in MemTaskStore.getJobKeys()

2016-12-06 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1847?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15726960#comment-15726960
 ] 

Zameer Manji commented on AURORA-1847:
--

Could this be resolved by moving to {{DBTaskStore}} or does that have too many 
drawbacks?

> Eliminate sequential scan in MemTaskStore.getJobKeys()
> --
>
> Key: AURORA-1847
> URL: https://issues.apache.org/jira/browse/AURORA-1847
> Project: Aurora
>  Issue Type: Story
>  Components: Efficiency, UI
>Reporter: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
>
> The existing {{TaskStoreBenchmarks}} shows {{DBTaskStore}} is almost two 
> orders of magnitude faster than {{MemTaskStore}} when it comes to 
> {{getJobKeys()}}:
> {code}
> Benchmark   (numTasks)   Mode  Cnt  
> Score  Error  Units
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run1  thrpt5  
> 78430.531 ± 3255.027  ops/s
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run5  thrpt5  
> 50774.988 ± 8986.951  ops/s
> TaskStoreBenchmarks.DBFetchTasksBenchmark.run   10  thrpt5   
> 2480.074 ± 9833.122  ops/s
> TaskStoreBenchmarks.MemFetchTasksBenchmark.run   1  thrpt5   
> 1189.568 ±  108.146  ops/s
> TaskStoreBenchmarks.MemFetchTasksBenchmark.run   5  thrpt5
> 124.990 ±   27.605  ops/s
> TaskStoreBenchmarks.MemFetchTasksBenchmark.run  10  thrpt5 
> 35.724 ±   15.101  ops/s
> {code}
> If scheduler is configured to run with the {{MemTaskStore}} every hit on 
> scheduler page ({{/scheduler}}) causes a call to 
> {{MemTaskStore.getJobKeys()}}. 
> The implementation of this method is currently very inefficient as it results 
> in a sequential scan of the task store and then mapping to their respective 
> job keys. The sequential scan and mapping to job key can be eliminated by 
> simply returning the key set of the existing secondary index  {{job}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2016-12-05 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15723217#comment-15723217
 ] 

Zameer Manji commented on AURORA-1823:
--

Although I think that the {{createJob}} API should use multiple threads to move 
a job's tasks into PENDING, benchmarking shows logging is still the slowest 
part.

There was a good performance improvement in 
https://github.com/apache/aurora/commit/4bc5246149f296b14dc520bedd71747fdb2578fb
 so I think I'm just going to close this for now.

> `createJob` API uses single thread to move all tasks to PENDING 
> 
>
> Key: AURORA-1823
> URL: https://issues.apache.org/jira/browse/AURORA-1823
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
>
> If you create a single job with many tasks (lets say 10k+) the `createJob` 
> API will take a long time. This is because the `createJob` API only returns 
> when all of the tasks have moved to PENDING and it uses a single thread to do 
> so. Here is a snippet of the logs:
> {noformat}
> ...
> I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
>  state machine transition INIT -> PENDING
> I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
> I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
>  state machine transition INIT -> PENDING
> I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
> I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
>  state machine transition INIT -> PENDING
> I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
> I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
>  state machine transition INIT -> PENDING
> I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
> I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
>  state machine transition INIT -> PENDING
> I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
> I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
>  state machine transition INIT -> PENDING
> I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
> I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
>  state machine transition INIT -> PENDING
> I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
> ...
> {noformat}
> Observe that a single jetty thread is doing this.
> We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1844) Force a snapshot at the end of Scheduler startup.

2016-12-02 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15716903#comment-15716903
 ] 

Zameer Manji commented on AURORA-1844:
--

This might be a dupe of AURORA-1812

> Force a snapshot at the end of Scheduler startup.
> -
>
> Key: AURORA-1844
> URL: https://issues.apache.org/jira/browse/AURORA-1844
> Project: Aurora
>  Issue Type: Task
>Reporter: Santhosh Kumar Shanmugham
>Priority: Minor
>
> When the scheduler starts up, it replays the logs from the replicated log to 
> catch up with the current state, before announcing itself as the leader to 
> the outside world. If for any reason after this replay, the scheduler dies 
> after adding more log entires, the next startup will have to redo the work 
> again. This becomes problem when the amount of additional work added is not 
> trivial, and can take the scheduler down the path of a spiraling death. One 
> example, of this is when the TaskHistoryPruner cleans up the DB but adds to 
> the log entires. In order to avoid the repeated work, the scheduler should 
> force a snapshot after the initial replay.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1831) Tweak logging pattern to improve performance

2016-12-01 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1831:


Assignee: Zameer Manji

> Tweak logging pattern to improve performance
> 
>
> Key: AURORA-1831
> URL: https://issues.apache.org/jira/browse/AURORA-1831
> Project: Aurora
>  Issue Type: Task
>  Components: Efficiency
>Reporter: Mehrdad Nurolahzade
>Assignee: Zameer Manji
>Priority: Minor
>  Labels: newbie
>
> The choice of logging pattern can have an impact on the system performance. 
> Using expensive patterns like class name or line number is discouraged for 
> performance critical systems like Aurora. 
> A recent experiment with the task state machine benchmark revealed ~2x 
> performance improvement when class name and file number patterns were 
> removed. Tweak Aurora's default logging pattern to improve logging 
> performance.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1669) Kill twitter/commons ZK libs when Curator replacements are vetted

2016-12-01 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712330#comment-15712330
 ] 

Zameer Manji commented on AURORA-1669:
--

Here is my assessment on how to fix AURORA-1840:

* We cannot upgrade to Curator 3.x because it will only work with ZK 3.5.x 
which has not been released yet.
* We can move to the {{LeaderSelector}} recipe (per [~StephanErb]'s suggestion) 
and figure out how to make it backwards compatible for leader discovery.
* We can figure out how to override the error handling capability of 
{{LeaderLatch}} to have it not lose leadership on session suspension, only 
loss. 

> Kill twitter/commons ZK libs when Curator replacements are vetted
> -
>
> Key: AURORA-1669
> URL: https://issues.apache.org/jira/browse/AURORA-1669
> Project: Aurora
>  Issue Type: Task
>Reporter: John Sirois
>Assignee: John Sirois
>
> Once we have reports from production users that the Curator zk plumbing 
> introduced in AURORA-1468 is working well, the {{-zk_use_curator}} flag 
> should be deprecated and then the flag and commons code killed.  If the 
> vetting happens before the next release ({{0.14.0}}), we can dispense with a 
> deprecation cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (AURORA-1669) Kill twitter/commons ZK libs when Curator replacements are vetted

2016-12-01 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reopened AURORA-1669:
--

Re-opening this because of AURORA-1840

Per [~jsirois]'s suggestion we may need to upgrade 
[Curator|https://issues.apache.org/jira/browse/AURORA-1840?focusedCommentId=15712226=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15712226].

> Kill twitter/commons ZK libs when Curator replacements are vetted
> -
>
> Key: AURORA-1669
> URL: https://issues.apache.org/jira/browse/AURORA-1669
> Project: Aurora
>  Issue Type: Task
>Reporter: John Sirois
>Assignee: John Sirois
>
> Once we have reports from production users that the Curator zk plumbing 
> introduced in AURORA-1468 is working well, the {{-zk_use_curator}} flag 
> should be deprecated and then the flag and commons code killed.  If the 
> vetting happens before the next release ({{0.14.0}}), we can dispense with a 
> deprecation cycle.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2016-12-01 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712240#comment-15712240
 ] 

Zameer Manji commented on AURORA-1840:
--

+1

This seems identical to the behaviour of the previous implementation.

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
>Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1840) Issue with Curator-backed discovery under heavy load

2016-12-01 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15712224#comment-15712224
 ] 

Zameer Manji commented on AURORA-1840:
--

I don't object to reverting this until some analysis can be done.

> Issue with Curator-backed discovery under heavy load
> 
>
> Key: AURORA-1840
> URL: https://issues.apache.org/jira/browse/AURORA-1840
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: David McLaughlin
>Assignee: David McLaughlin
>Priority: Blocker
> Fix For: 0.17.0
>
>
> We've been having some performance issues recently with our production 
> clusters at Twitter. A side-effect of these are occassional stop-the-world GC 
> pauses for up to 15 seconds. This has been happening at our scale for quite 
> some time, but previous versions of the Scheduler were resilient to this and 
> no leadership change would occur. 
> Since we moved to Curator, we are no longer resilient to these GC pauses. The 
> Scheduler is now failing over any time we see a GC pause, even though these 
> pauses are within the session timeout. Here is an example pause in the 
> scheduler logs with the associated ZK session timeout that leads to a 
> failover:
> {code}
> I1118 19:40:16.871801 51800 sched.cpp:1025] Scheduler::statusUpdate took 
> 586236ns
> I1118 19:40:16.902 [TaskGroupBatchWorker, StateMachine$Builder:389] 
> redacted-9f565b4-067e-422f-b641-c6000f9ae2c8 state machine transition PENDING 
> -> ASSIGNED 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskStateMachine:474] Adding work 
> command SAVE_STATE for redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8 
> I1118 19:40:16.903 [TaskGroupBatchWorker, TaskAssigner$TaskAssignerImpl:130] 
> Offer on agent redacted (id 566ae347-c1b6-44ce-8551-b7a6cda72989-S7579) is 
> being assigned task redacted-0-49f565b4-067e-422f-b641-c6000f9ae2c8. 
> W1118 19:40:31.744 [main-SendThread(redacted:2181), 
> ClientCnxn$SendThread:1108] Client session timed out, have not heard from 
> server in 20743ms for sessionid 0x6584fd2b34ede86 
> {code}
> As you can see from the timestamps, there was a 15s GC pause (confirmed in 
> our GC logs - a CMS promotion failure caused the pause) and this triggers a 
> session timeout of 20s to fire. Note: we have seen GC pauses as little as 7s 
> cause the same behavior. Removed: my ZK was rusty. 20s is 2/3 of our 30s ZK 
> timeout, so our session timeout is being wired through fine. 
> We have confirmed that the Scheduler no longer fails over when deploying from 
> HEAD with these two commits reverted and setting zk_use_curator to false:
> https://github.com/apache/aurora/commit/b417be38fe1fcae6b85f7e91cea961ab272adf3f
> https://github.com/apache/aurora/commit/69cba786efc2628eab566201dfea46836a1d9af5
> This is a pretty big blocker for us given how expensive Scheduler failovers 
> are (currently several minutes for us). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1834) Expose stats on undelivered event bus events

2016-11-29 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1834?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15706300#comment-15706300
 ] 

Zameer Manji commented on AURORA-1834:
--

This is a good idea, we should count this much like how we count uncaught 
exceptions in the scheduling loop. It would be good to alert on and can track 
regressions.

> Expose stats on undelivered event bus events
> 
>
> Key: AURORA-1834
> URL: https://issues.apache.org/jira/browse/AURORA-1834
> Project: Aurora
>  Issue Type: Task
>  Components: Scheduler
>Reporter: Mehrdad Nurolahzade
>Priority: Minor
>  Labels: newbie
>
> {{DeadEvent}} is a wrapper for an event that was posted, but which had no 
> subscribers and thus could not be delivered. {{PubSubEventModule}} is 
> currently utilizing a {{DeadEventHandler}} for logging such events but it 
> should additionally expose stats.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1825) Enable async logging by default

2016-11-23 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691758#comment-15691758
 ] 

Zameer Manji commented on AURORA-1825:
--

Locally I removed the expensive parts of our logback config with:
{noformat}
diff --git c/src/main/resources/logback.xml w/src/main/resources/logback.xml
index 84c175c..6206806 100644
--- c/src/main/resources/logback.xml
+++ w/src/main/resources/logback.xml
@@ -23,7 +23,7 @@ limitations under the License.
 System.err
 
   
-%.-1level%date{MMdd HH:mm:ss.SSS} [%thread, %class{0}:%line] %message 
%xThrowable%n
+%.-1level%date{MMdd HH:mm:ss.SSS} [%thread] %message %xThrowable%n
   
 
   

{noformat}

Before:
{noformat}
Benchmark   (numPendingTasks)  
(numTasksToDelete)   Mode  Cnt  Score   Error  Units
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
 1000  thrpt   10  2.510 ± 0.557  ops/s
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
1  thrpt   10  0.272 ± 0.030  ops/s
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
5  thrpt   10  0.053 ± 0.011  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000   
  N/A  thrpt   10  2.446 ± 0.698  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1   
  N/A  thrpt   10  0.246 ± 0.018  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5   
  N/A  thrpt   10  0.041 ± 0.006  ops/s
{noformat}

After:

{noformat}
Benchmark   (numPendingTasks)  
(numTasksToDelete)   Mode  Cnt  Score   Error  Units
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
 1000  thrpt   10  8.640 ± 1.431  ops/s
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
1  thrpt   10  0.892 ± 0.066  ops/s
StateManagerBenchmarks.DeleteTasksBenchmark.run   N/A   
5  thrpt   10  0.172 ± 0.010  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run   1000   
  N/A  thrpt   10  4.837 ± 1.511  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run  1   
  N/A  thrpt   10  0.510 ± 0.315  ops/s
StateManagerBenchmarks.InsertPendingTasksBenchmark.run  5   
  N/A  thrpt   10  0.079 ± 0.052  ops/s
{noformat}

I picked this benchmark because it logs a lot in the critical path.

We could probably fix this problem by removing line number and removing class 
name with the logger name. The net result would be no line numbers but way 
faster logging.

> Enable async logging by default
> ---
>
> Key: AURORA-1825
> URL: https://issues.apache.org/jira/browse/AURORA-1825
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>
> Based on my experience while working on AURORA-1823 and [~StephanErb]'s work 
> on logging recently, I think it would be best if we enabled async logging.
> For example if one attempts to parallelize the work inside 
> {{StateManagerImpl}} there isn't much benefit because all of the state 
> transitions are logged and all of the threads would contend for the lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1827) Fix SLA percentile calculation

2016-11-23 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15691040#comment-15691040
 ] 

Zameer Manji commented on AURORA-1827:
--

I upgraded us to Guava 20. It has a 
[Quantiles|http://google.github.io/guava/releases/20.0/api/docs/com/google/common/math/Quantiles.html]
 class and a 
[Stats|http://google.github.io/guava/releases/20.0/api/docs/com/google/common/math/Stats.html]
 class that could be very helpful here.

> Fix SLA percentile calculation 
> ---
>
> Key: AURORA-1827
> URL: https://issues.apache.org/jira/browse/AURORA-1827
> Project: Aurora
>  Issue Type: Story
>Reporter: Reza Motamedi
>Priority: Trivial
>  Labels: newbie, sla
>
> The calculation of mttX (median-time-to-X) depends on the computation of 
> percentile values. The current implementation does not behave nicely with a 
> small sample size. For instance, for a given sample set of  {50, 150}, 
> 50-percentile is reported to be 50. Although, 100 seems a more appropriate 
> return value.
> One solution is to modify `SlaUtil` to perform an extrapolation when the 
> sample size is small or when the corresponding index to a percentile value is 
> not an integer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2016-11-22 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15688685#comment-15688685
 ] 

Zameer Manji commented on AURORA-1823:
--

Benchmarks for {{StateManagerImpl}} to validate any changes: 
https://reviews.apache.org/r/54011/

> `createJob` API uses single thread to move all tasks to PENDING 
> 
>
> Key: AURORA-1823
> URL: https://issues.apache.org/jira/browse/AURORA-1823
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
>
> If you create a single job with many tasks (lets say 10k+) the `createJob` 
> API will take a long time. This is because the `createJob` API only returns 
> when all of the tasks have moved to PENDING and it uses a single thread to do 
> so. Here is a snippet of the logs:
> {noformat}
> ...
> I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
>  state machine transition INIT -> PENDING
> I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
> I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
>  state machine transition INIT -> PENDING
> I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
> I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
>  state machine transition INIT -> PENDING
> I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
> I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
>  state machine transition INIT -> PENDING
> I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
> I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
>  state machine transition INIT -> PENDING
> I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
> I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
>  state machine transition INIT -> PENDING
> I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
> I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
>  state machine transition INIT -> PENDING
> I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
> ...
> {noformat}
> Observe that a single jetty thread is doing this.
> We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1825) Enable async logging by default

2016-11-22 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15688506#comment-15688506
 ] 

Zameer Manji commented on AURORA-1825:
--

We could achieve this by changing {{logback.xml}} to use this: 
http://logback.qos.ch/manual/appenders.html#AsyncAppender

> Enable async logging by default
> ---
>
> Key: AURORA-1825
> URL: https://issues.apache.org/jira/browse/AURORA-1825
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Priority: Minor
>
> Based on my experience while working on AURORA-1823 and [~StephanErb]'s work 
> on logging recently, I think it would be best if we enabled async logging.
> For example if one attempts to parallelize the work inside 
> {{StateManagerImpl}} there isn't much benefit because all of the state 
> transitions are logged and all of the threads would contend for the lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1825) Enable async logging by default

2016-11-22 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1825:


 Summary: Enable async logging by default
 Key: AURORA-1825
 URL: https://issues.apache.org/jira/browse/AURORA-1825
 Project: Aurora
  Issue Type: Task
Reporter: Zameer Manji
Priority: Minor


Based on my experience while working on AURORA-1823 and [~StephanErb]'s work on 
logging recently, I think it would be best if we enabled async logging.

For example if one attempts to parallelize the work inside {{StateManagerImpl}} 
there isn't much benefit because all of the state transitions are logged and 
all of the threads would contend for the lock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2016-11-22 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1823:


Assignee: Zameer Manji

> `createJob` API uses single thread to move all tasks to PENDING 
> 
>
> Key: AURORA-1823
> URL: https://issues.apache.org/jira/browse/AURORA-1823
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>Priority: Minor
>
> If you create a single job with many tasks (lets say 10k+) the `createJob` 
> API will take a long time. This is because the `createJob` API only returns 
> when all of the tasks have moved to PENDING and it uses a single thread to do 
> so. Here is a snippet of the logs:
> {noformat}
> ...
> I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
>  state machine transition INIT -> PENDING
> I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
> I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
>  state machine transition INIT -> PENDING
> I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
> I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
>  state machine transition INIT -> PENDING
> I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
> I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
>  state machine transition INIT -> PENDING
> I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
> I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
>  state machine transition INIT -> PENDING
> I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
> I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
>  state machine transition INIT -> PENDING
> I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
> I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
>  state machine transition INIT -> PENDING
> I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
> ...
> {noformat}
> Observe that a single jetty thread is doing this.
> We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1014) Client binding_helper to resolve docker label to a stable ID at create

2016-11-21 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15684360#comment-15684360
 ] 

Zameer Manji commented on AURORA-1014:
--

[~StephanErb] [~santhk]

Can we close this ticket and make a new one for Mesos images?

> Client binding_helper to resolve docker label to a stable ID at create
> --
>
> Key: AURORA-1014
> URL: https://issues.apache.org/jira/browse/AURORA-1014
> Project: Aurora
>  Issue Type: Story
>  Components: Client, Packaging
>Reporter: Kevin Sweeney
>Assignee: Santhosh Kumar Shanmugham
> Fix For: 0.17.0
>
>
> Follow-up from discussion on IRC:
> Some docker labels are mutable, meaning the image a task runs in could change 
> from restart to restart even if the rest of the task config doesn't change. 
> This breaks assumptions that make rolling updates the safe and preferred way 
> to deploy a new Aurora job
> Add a binding helper that resolves a docker label to an immutable image 
> identifier at create time and make it the default for the Docker helper 
> introduced in https://reviews.apache.org/r/28920/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2016-11-19 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15679664#comment-15679664
 ] 

Zameer Manji commented on AURORA-1823:
--

Upon some further analysis {{BatchWorker}} might not help us here. After some 
JMH benchmarking and profiling, the biggest problem with {{insertPendingTasks}} 
is that it doesn't use the bulk storage API {{saveTasks}}. Instead it calls 
{{mutateTask}} for every task that is moving to {{PENDING}}. I can get a 10x+ 
improvement in throughput by simply queueing up mutations and side effects that 
are a result of the state machine and then calling {{saveTasks}} once all of 
the mutations have been computed.

I'm going to look into refactoring {{StateManagerImpl}} to support evaluating 
multiple task state machine concurrently and then  merging all of the side 
effects from those state machines into a single operation.


> `createJob` API uses single thread to move all tasks to PENDING 
> 
>
> Key: AURORA-1823
> URL: https://issues.apache.org/jira/browse/AURORA-1823
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Priority: Minor
>
> If you create a single job with many tasks (lets say 10k+) the `createJob` 
> API will take a long time. This is because the `createJob` API only returns 
> when all of the tasks have moved to PENDING and it uses a single thread to do 
> so. Here is a snippet of the logs:
> {noformat}
> ...
> I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
>  state machine transition INIT -> PENDING
> I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
> I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
>  state machine transition INIT -> PENDING
> I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
> I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
>  state machine transition INIT -> PENDING
> I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
> I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
>  state machine transition INIT -> PENDING
> I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
> I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
>  state machine transition INIT -> PENDING
> I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
> I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
>  state machine transition INIT -> PENDING
> I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
> I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
>  state machine transition INIT -> PENDING
> I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
> ...
> {noformat}
> Observe that a single jetty thread is doing this.
> We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2016-11-18 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15677409#comment-15677409
 ] 

Zameer Manji commented on AURORA-1823:
--

Agreed that our API should do this.

Simple profiling indicates this is slow because a single thread is iterating 
over every task and doing a single write for each one. If we did batching we 
could have a single thread moving many to PENDING at a time and if we used 
batchwoker we could have a pool of threads doing this.

I'm not going to change the semantics of the API with BatchWorker. BatchWorker 
provides a future, and the caller of batch worker can block until the future 
resolves. Instead I think it would be best to move multiple tasks from INIT to 
PENDING at a time and have multiple threads doing that concurrently since there 
is no data dependency between the tasks.

> `createJob` API uses single thread to move all tasks to PENDING 
> 
>
> Key: AURORA-1823
> URL: https://issues.apache.org/jira/browse/AURORA-1823
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Priority: Minor
>
> If you create a single job with many tasks (lets say 10k+) the `createJob` 
> API will take a long time. This is because the `createJob` API only returns 
> when all of the tasks have moved to PENDING and it uses a single thread to do 
> so. Here is a snippet of the logs:
> {noformat}
> ...
> I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
>  state machine transition INIT -> PENDING
> I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
> I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
>  state machine transition INIT -> PENDING
> I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
> I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
>  state machine transition INIT -> PENDING
> I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
> I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
>  state machine transition INIT -> PENDING
> I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
> I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
>  state machine transition INIT -> PENDING
> I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
> I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
>  state machine transition INIT -> PENDING
> I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
> I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
>  state machine transition INIT -> PENDING
> I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work 
> command SAVE_STATE for 
> sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
> ...
> {noformat}
> Observe that a single jetty thread is doing this.
> We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1823) `createJob` API uses single thread to move all tasks to PENDING

2016-11-17 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1823:


 Summary: `createJob` API uses single thread to move all tasks to 
PENDING 
 Key: AURORA-1823
 URL: https://issues.apache.org/jira/browse/AURORA-1823
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji
Priority: Minor


If you create a single job with many tasks (lets say 10k+) the `createJob` API 
will take a long time. This is because the `createJob` API only returns when 
all of the tasks have moved to PENDING and it uses a single thread to do so. 
Here is a snippet of the logs:

{noformat}
...
I1116 17:11:53.964 [qtp1219612889-50, StateMachine$Builder:389] 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
 state machine transition INIT -> PENDING
I1116 17:11:53.965 [qtp1219612889-50, TaskStateMachine:474] Adding work command 
SAVE_STATE for 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57114-8aff8e77-3bde-4a83-99eb-8c6e52f14a7a
I1116 17:11:54.094 [qtp1219612889-50, StateMachine$Builder:389] 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
 state machine transition INIT -> PENDING
I1116 17:11:54.094 [qtp1219612889-50, TaskStateMachine:474] Adding work command 
SAVE_STATE for 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57115-f5baa93f-78af-470d-bcdf-1d86c0b98c80
I1116 17:11:54.223 [qtp1219612889-50, StateMachine$Builder:389] 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
 state machine transition INIT -> PENDING
I1116 17:11:54.224 [qtp1219612889-50, TaskStateMachine:474] Adding work command 
SAVE_STATE for 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57116-0553d98c-f5de-4857-9a70-c5c748ddee03
I1116 17:11:54.353 [qtp1219612889-50, StateMachine$Builder:389] 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
 state machine transition INIT -> PENDING
I1116 17:11:54.353 [qtp1219612889-50, TaskStateMachine:474] Adding work command 
SAVE_STATE for 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57117-46e168f6-8753-4be0-873d-f18d1f562570
I1116 17:11:54.482 [qtp1219612889-50, StateMachine$Builder:389] 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
 state machine transition INIT -> PENDING
I1116 17:11:54.482 [qtp1219612889-50, TaskStateMachine:474] Adding work command 
SAVE_STATE for 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57118-ac94b4fb-f319-4ca2-b788-2ee093ef1c67
I1116 17:11:54.611 [qtp1219612889-50, StateMachine$Builder:389] 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
 state machine transition INIT -> PENDING
I1116 17:11:54.612 [qtp1219612889-50, TaskStateMachine:474] Adding work command 
SAVE_STATE for 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57119-060ef7fc-7e17-4f8c-83dc-216550332153
I1116 17:11:54.741 [qtp1219612889-50, StateMachine$Builder:389] 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
 state machine transition INIT -> PENDING
I1116 17:11:54.742 [qtp1219612889-50, TaskStateMachine:474] Adding work command 
SAVE_STATE for 
sparker1-devel-echo-8017fae7-f592-49c7-bfef-fac912abecaa-57120-c163c750-3658-44b7-b1ea-43f5d503f7c9
...
{noformat}

Observe that a single jetty thread is doing this.

We should leverage {{BatchWorker}} to have concurrent mutations here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1821) Bump Guava to 20

2016-11-15 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1821:


Assignee: Zameer Manji

> Bump Guava to 20
> 
>
> Key: AURORA-1821
> URL: https://issues.apache.org/jira/browse/AURORA-1821
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Guava 20 is now out with a bunch of improvements. We should take in the 
> upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1821) Bump Guava to 20

2016-11-15 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1821:


 Summary: Bump Guava to 20
 Key: AURORA-1821
 URL: https://issues.apache.org/jira/browse/AURORA-1821
 Project: Aurora
  Issue Type: Task
Reporter: Zameer Manji


Guava 20 is now out with a bunch of improvements. We should take in the upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1820) Reduce storage write lock contention by adopting Double-Checked Locking pattern in TimedOutTaskHandler

2016-11-15 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1820?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15667969#comment-15667969
 ] 

Zameer Manji commented on AURORA-1820:
--

Good find [~mnurolahzade]!

Do we measure throughput of {{TimedOutTaskHandler}} in benchmarks already?

> Reduce storage write lock contention by adopting Double-Checked Locking 
> pattern in TimedOutTaskHandler
> --
>
> Key: AURORA-1820
> URL: https://issues.apache.org/jira/browse/AURORA-1820
> Project: Aurora
>  Issue Type: Task
>  Components: Efficiency, Scheduler
>Reporter: Mehrdad Nurolahzade
>Assignee: Mehrdad Nurolahzade
>Priority: Critical
>
> {{TimedOutTaskHandler}} acquires storage write lock for every task every time 
> they transition to a transient state. It then verifies after a default 
> time-out period of 5 minutes if the task has transitioned out of the 
> transient state. 
> The verification step takes place while holding the storage write lock. In 
> over 99% of cases the logic short-circuits and returns from 
> {{StateManagerImpl.updateTaskAndExternalState()}} once it learns task has 
> transitioned out of the transient state.
> Reduce storage write lock contention by adopting [Double-Checked 
> Locking|https://en.wikipedia.org/wiki/Double-checked_locking] pattern in 
> {{TimedOutTaskHandler.run()}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1815) Fix checksums for packages on bintray

2016-11-10 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1815:
-
Fix Version/s: 0.17.0

> Fix checksums for packages on bintray
> -
>
> Key: AURORA-1815
> URL: https://issues.apache.org/jira/browse/AURORA-1815
> Project: Aurora
>  Issue Type: Story
>  Components: Packaging
>Affects Versions: 0.16.0
>Reporter: Thomas Bach
>Priority: Minor
> Fix For: 0.17.0
>
>
> The checksum files on bintray are wrong. Take for example the content of 
> {{aurora-scheduler_0.16.0_amd64.deb.sha}}:
> {quote}
> b6203f169df44d9a91df3dfe4670950c3ab49eb4  
> /Users/jcohen/workspace/external/aurora-packaging/artifacts/aurora-ubuntu-trusty/dist/aurora-scheduler_0.16.0_amd64.deb
> {quote}
> This should actually be:
> {quote}
> b6203f169df44d9a91df3dfe4670950c3ab49eb4  aurora-scheduler_0.16.0_amd64.deb
> {quote}
> NOTE: The checksum themselves seem to be correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1815) Fix checksums for packages on bintray

2016-11-10 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655788#comment-15655788
 ] 

Zameer Manji commented on AURORA-1815:
--

Seems like a problem with the script/tooling.

We should fix this before 0.17 and figure out how to fix the old shas.

> Fix checksums for packages on bintray
> -
>
> Key: AURORA-1815
> URL: https://issues.apache.org/jira/browse/AURORA-1815
> Project: Aurora
>  Issue Type: Story
>  Components: Packaging
>Affects Versions: 0.16.0
>Reporter: Thomas Bach
>Priority: Minor
> Fix For: 0.17.0
>
>
> The checksum files on bintray are wrong. Take for example the content of 
> {{aurora-scheduler_0.16.0_amd64.deb.sha}}:
> {quote}
> b6203f169df44d9a91df3dfe4670950c3ab49eb4  
> /Users/jcohen/workspace/external/aurora-packaging/artifacts/aurora-ubuntu-trusty/dist/aurora-scheduler_0.16.0_amd64.deb
> {quote}
> This should actually be:
> {quote}
> b6203f169df44d9a91df3dfe4670950c3ab49eb4  aurora-scheduler_0.16.0_amd64.deb
> {quote}
> NOTE: The checksum themselves seem to be correct. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1812) Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore

2016-11-09 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1812:
-
Fix Version/s: 0.17.0

> Upgrading scheduler multiple times in succession can lead to incompatible 
> snapshot restore 
> ---
>
> Key: AURORA-1812
> URL: https://issues.apache.org/jira/browse/AURORA-1812
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.14.0
> Environment: Mesos-0.27.2 aurora-scheduler-0.14.0
>Reporter: Patrick Veasey
>Priority: Minor
> Fix For: 0.17.0
>
>
> When upgrading scheduler multiple times in a row there can be a situation 
> where the snapshot is restored is from an incompatible version. Which will 
> cause scheduler to fail to start, with SQL exceptions. Workaround is to 
> ensure the most current snapshot was taken by the current version of aurora, 
> either by manually starting snapshot or setting dlog_snapshot_interval to a 
> low timeframe. 
> Log of failure can be found here:
> https://gist.github.com/Pveasey/4ca1ad4d3ded21cd6e1674f20a8a4af3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1812) Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore

2016-11-09 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652305#comment-15652305
 ] 

Zameer Manji commented on AURORA-1812:
--

I've put it on the list for 0.17

The fix could be changing our docs to say upgrading from old versions requires 
the operator to trigger a snapshot manually from `aurora_admin` and from 0.17+ 
they don't need to do that.

> Upgrading scheduler multiple times in succession can lead to incompatible 
> snapshot restore 
> ---
>
> Key: AURORA-1812
> URL: https://issues.apache.org/jira/browse/AURORA-1812
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.14.0
> Environment: Mesos-0.27.2 aurora-scheduler-0.14.0
>Reporter: Patrick Veasey
>Priority: Minor
> Fix For: 0.17.0
>
>
> When upgrading scheduler multiple times in a row there can be a situation 
> where the snapshot is restored is from an incompatible version. Which will 
> cause scheduler to fail to start, with SQL exceptions. Workaround is to 
> ensure the most current snapshot was taken by the current version of aurora, 
> either by manually starting snapshot or setting dlog_snapshot_interval to a 
> low timeframe. 
> Log of failure can be found here:
> https://gist.github.com/Pveasey/4ca1ad4d3ded21cd6e1674f20a8a4af3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1814) Consider supporting PARTITION_AWARE capability

2016-11-09 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1814:


 Summary: Consider supporting PARTITION_AWARE capability 
 Key: AURORA-1814
 URL: https://issues.apache.org/jira/browse/AURORA-1814
 Project: Aurora
  Issue Type: Task
Reporter: Zameer Manji


Mesos 1.1.0 comes with a new capability called {{PARTITION_AWARE}}. If we opt 
in the following states would replace {{TASK_LOST}}

{noformat}
TASK_DROPPED
TASK_UNREACHABLE
TASK_GONE
TASK_GONE_BY_OPERATOR
TASK_UNKNOWN
{noformat}

We should consider adopting this. Even if the initial cut is just mapping all 
of those new states to {{TASK_LOST}} internally.

These new states might simplify our reconciliation code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1800) Support Mesos Maintenance primitives

2016-11-09 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652287#comment-15652287
 ] 

Zameer Manji commented on AURORA-1800:
--

Mesos 1.1.0 comes with a new HTTP based driver. I think this is blocked on 
upgrading to that first.

> Support Mesos Maintenance primitives
> 
>
> Key: AURORA-1800
> URL: https://issues.apache.org/jira/browse/AURORA-1800
> Project: Aurora
>  Issue Type: Story
>  Components: Maintenance
>Reporter: Ankit Khera
>
> Support Mesos Maintenance primitives
> Mesos 0.25.0 introduced the notion of maintenance primitives using which 
> operators can post maintenance schedule for machines.  
> More details here : http://mesos.apache.org/documentation/latest/maintenance/
> This request to have aurora start using these primitives and drain machines 
> in an SLA aware manner. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1813) Bump Mesos support to 1.1.0

2016-11-09 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1813:


 Summary: Bump Mesos support to 1.1.0
 Key: AURORA-1813
 URL: https://issues.apache.org/jira/browse/AURORA-1813
 Project: Aurora
  Issue Type: Task
Reporter: Zameer Manji


RC3 is out for Mesos 1.1.0 and it looks like it is going to pass, we should 
bump our support in 0.17 for this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1813) Bump Mesos support to 1.1.0

2016-11-09 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1813:
-
Fix Version/s: 0.17.0

> Bump Mesos support to 1.1.0
> ---
>
> Key: AURORA-1813
> URL: https://issues.apache.org/jira/browse/AURORA-1813
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
> Fix For: 0.17.0
>
>
> RC3 is out for Mesos 1.1.0 and it looks like it is going to pass, we should 
> bump our support in 0.17 for this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1812) Upgrading scheduler multiple times in succession can lead to incompatible snapshot restore

2016-11-09 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15652255#comment-15652255
 ] 

Zameer Manji commented on AURORA-1812:
--

[~joshua.cohen] [~StephanErb]

Maybe we can fix this by having the scheduler take a (new) snapshot right after 
recovery if there was schema migrations?

> Upgrading scheduler multiple times in succession can lead to incompatible 
> snapshot restore 
> ---
>
> Key: AURORA-1812
> URL: https://issues.apache.org/jira/browse/AURORA-1812
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.14.0
> Environment: Mesos-0.27.2 aurora-scheduler-0.14.0
>Reporter: Patrick Veasey
>Priority: Minor
>
> When upgrading scheduler multiple times in a row there can be a situation 
> where the snapshot is restored is from an incompatible version. Which will 
> cause scheduler to fail to start, with SQL exceptions. Workaround is to 
> ensure the most current snapshot was taken by the current version of aurora, 
> either by manually starting snapshot or setting dlog_snapshot_interval to a 
> low timeframe. 
> Log of failure can be found here:
> https://gist.github.com/Pveasey/4ca1ad4d3ded21cd6e1674f20a8a4af3



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

2016-11-04 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1809:
-
Fix Version/s: 0.17.0

> Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed
> ---
>
> Key: AURORA-1809
> URL: https://issues.apache.org/jira/browse/AURORA-1809
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
> Fix For: 0.17.0
>
>
> If you run it apart of the full test suite it fails like this:
> {noformat}
>   FAILURES 
>  __ TestRunnerKillProcessGroup.test_pg_is_killed __
>  
>  self =  object at 0x7f0c79893e10>
>  
>  def test_pg_is_killed(self):
>    runner = self.start_runner()
>    tm = TaskMonitor(runner.tempdir, 
> runner.task_id)
>    self.wait_until_running(tm)
>    process_state, run_number = 
> tm.get_active_processes()[0]
>    assert process_state.process == 'process'
>    assert run_number == 0
>  
>    child_pidfile = os.path.join(runner.sandbox, 
> runner.task_id, 'child.txt')
>    while not os.path.exists(child_pidfile):
>  time.sleep(0.1)
>    parent_pidfile = os.path.join(runner.sandbox, 
> runner.task_id, 'parent.txt')
>    while not os.path.exists(parent_pidfile):
>  time.sleep(0.1)
>    with open(child_pidfile) as fp:
>  child_pid = int(fp.read().rstrip())
>    with open(parent_pidfile) as fp:
>  parent_pid = int(fp.read().rstrip())
>  
>    ps = ProcessProviderFactory.get()
>    ps.collect_all()
>    assert parent_pid in ps.pids()
>    assert child_pid in ps.pids()
>    assert child_pid in 
> ps.children_of(parent_pid)
>  
>    with open(os.path.join(runner.sandbox, 
> runner.task_id, 'exit.txt'), 'w') as fp:
>  fp.write('go away!')
>  
>    while tm.task_state() is not 
> TaskState.SUCCESS:
>  time.sleep(0.1)
>  
>    state = tm.get_state()
>    assert state.processes['process'][0].state == 
> ProcessState.SUCCESS
>  
>    ps.collect_all()
>    assert parent_pid not in ps.pids()
>  > assert child_pid not in ps.pids()
>  E assert 30475 not in set([1, 2, 3, 5, 7, 
> 8, ...])
>  E  +  where set([1, 2, 3, 5, 7, 8, ...]) = 
>   at 0x7f0c798b1990>>()
>  E  +where  ProcessProvider_Procfs.pids of 
>  at 0x7f0c798b1990>> = 
>  at 0x7f0c798b1990>.pids
>  
>  
> src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError
>  -- Captured stderr call --
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>  WARNING:root:Could not read from checkpoint 
> /tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
>   generated xml file: 
> 

[jira] [Created] (AURORA-1809) Investigate flaky test TestRunnerKillProcessGroup.test_pg_is_killed

2016-11-04 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1809:


 Summary: Investigate flaky test 
TestRunnerKillProcessGroup.test_pg_is_killed
 Key: AURORA-1809
 URL: https://issues.apache.org/jira/browse/AURORA-1809
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


If you run it apart of the full test suite it fails like this:
{noformat}
  FAILURES 
 __ TestRunnerKillProcessGroup.test_pg_is_killed __
 
 self = 
 
 def test_pg_is_killed(self):
   runner = self.start_runner()
   tm = TaskMonitor(runner.tempdir, 
runner.task_id)
   self.wait_until_running(tm)
   process_state, run_number = 
tm.get_active_processes()[0]
   assert process_state.process == 'process'
   assert run_number == 0
 
   child_pidfile = os.path.join(runner.sandbox, 
runner.task_id, 'child.txt')
   while not os.path.exists(child_pidfile):
 time.sleep(0.1)
   parent_pidfile = os.path.join(runner.sandbox, 
runner.task_id, 'parent.txt')
   while not os.path.exists(parent_pidfile):
 time.sleep(0.1)
   with open(child_pidfile) as fp:
 child_pid = int(fp.read().rstrip())
   with open(parent_pidfile) as fp:
 parent_pid = int(fp.read().rstrip())
 
   ps = ProcessProviderFactory.get()
   ps.collect_all()
   assert parent_pid in ps.pids()
   assert child_pid in ps.pids()
   assert child_pid in 
ps.children_of(parent_pid)
 
   with open(os.path.join(runner.sandbox, 
runner.task_id, 'exit.txt'), 'w') as fp:
 fp.write('go away!')
 
   while tm.task_state() is not 
TaskState.SUCCESS:
 time.sleep(0.1)
 
   state = tm.get_state()
   assert state.processes['process'][0].state == 
ProcessState.SUCCESS
 
   ps.collect_all()
   assert parent_pid not in ps.pids()
 > assert child_pid not in ps.pids()
 E assert 30475 not in set([1, 2, 3, 5, 7, 8, 
...])
 E  +  where set([1, 2, 3, 5, 7, 8, ...]) = 
>()
 E  +where > = 
.pids
 
 
src/test/python/apache/thermos/core/test_staged_kill.py:287: AssertionError
 -- Captured stderr call --
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
 WARNING:root:Could not read from checkpoint 
/tmp/tmp9WSRnw/checkpoints/1478305991773556-runner-base/runner
  generated xml file: 
/home/jenkins/jenkins-slave/workspace/AuroraBot/dist/test-results/415337499eb72578eab327a6487c1f5c9452b3d6.xml
 
  1 failed, 719 passed, 6 skipped, 1 warnings in 
206.00 seconds 
 
FAILURE
{noformat}


If you run the test as a one off you see this:
{noformat}
00:45:32 00:00 [main]
   (To run a reporting server: ./pants server)
00:45:32 00:00   [setup]
00:45:32 00:00 [parse]fatal: Not a git repository (or any of the parent 
directories): .git

   

[jira] [Commented] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes

2016-11-04 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15637620#comment-15637620
 ] 

Zameer Manji commented on AURORA-1808:
--

https://github.com/apache/aurora/commit/5410c229f30d6d8e331cdddc5c84b9b2b5313c01

> Thermos executor should send SIGTERM to daemonized processes 
> -
>
> Key: AURORA-1808
> URL: https://issues.apache.org/jira/browse/AURORA-1808
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Thermos loses track of double forking processes, meaning on task teardown  
> the daemonized process will not receive a signal to shut down cleanly.
> This can be a serious issue if one is running two processes: 
> 1. nginx which demonizes and accepts HTTP requests.
> 2. A backend processes that receives traffic from nginx over a local socket. 
> On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to 
> still accept traffic even though the backend is dead. If thermos could also 
> send SIGTERM to 1, the task would tear down cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1792) Executor does not log full task information.

2016-11-03 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634819#comment-15634819
 ] 

Zameer Manji commented on AURORA-1792:
--

https://reviews.apache.org/r/53452/

> Executor does not log full task information.
> 
>
> Key: AURORA-1792
> URL: https://issues.apache.org/jira/browse/AURORA-1792
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> I launched a task that has an {{initial_interval_secs}} in the health check 
> config. However the log contains no information about this field:
> {noformat}
> $ grep "initial_interval_secs" __main__.log
> {noformat}
> We should log the entire ExecutorInfo blob.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1792) Executor does not log full task information.

2016-11-03 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1792:


Assignee: Zameer Manji

> Executor does not log full task information.
> 
>
> Key: AURORA-1792
> URL: https://issues.apache.org/jira/browse/AURORA-1792
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> I launched a task that has an {{initial_interval_secs}} in the health check 
> config. However the log contains no information about this field:
> {noformat}
> $ grep "initial_interval_secs" __main__.log
> {noformat}
> We should log the entire ExecutorInfo blob.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1780) Offers with unknown resources types to Aurora crash the scheduler

2016-11-03 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15634508#comment-15634508
 ] 

Zameer Manji commented on AURORA-1780:
--

Yes, that is the most desirable course of action for now.

> Offers with unknown resources types to Aurora crash the scheduler
> -
>
> Key: AURORA-1780
> URL: https://issues.apache.org/jira/browse/AURORA-1780
> Project: Aurora
>  Issue Type: Bug
> Environment: vagrant
>Reporter: Renan DelValle
>
> Taking offers from Agents which have resources that are not known to Aurora 
> cause the Scheduler to crash.
> Steps to reproduce:
> {code}
> vagrant up
> sudo service mesos-slave stop
> echo 
> "cpus(aurora-role):0.5;cpus(*):3.5;mem(aurora-role):1024;disk:2;gpus(*):4;test:200"
>  | sudo tee /etc/mesos-slave/resources
> sudo rm -f /var/lib/mesos/meta/slaves/latest
> sudo service mesos-slave start
> {code}
> Wait around a few moments for the offer to be made to Aurora
> {code}
> I0922 02:41:57.839 [Thread-19, MesosSchedulerImpl:142] Received notification 
> of lost agent: value: "cadaf569-171d-42fc-a417-fbd608ea5bab-S0"
> I0922 02:42:30.585597  2999 log.cpp:577] Attempting to append 109 bytes to 
> the log
> I0922 02:42:30.585654  2999 coordinator.cpp:348] Coordinator attempting to 
> write APPEND action at position 4
> I0922 02:42:30.585747  2999 replica.cpp:537] Replica received write request 
> for position 4 from (10)@192.168.33.7:8083
> I0922 02:42:30.586858  2999 leveldb.cpp:341] Persisting action (125 bytes) to 
> leveldb took 1.086601ms
> I0922 02:42:30.586897  2999 replica.cpp:712] Persisted action at 4
> I0922 02:42:30.587020  2999 replica.cpp:691] Replica received learned notice 
> for position 4 from @0.0.0.0:0
> I0922 02:42:30.587785  2999 leveldb.cpp:341] Persisting action (127 bytes) to 
> leveldb took 746999ns
> I0922 02:42:30.587805  2999 replica.cpp:712] Persisted action at 4
> I0922 02:42:30.587811  2999 replica.cpp:697] Replica learned APPEND action at 
> position 4
> I0922 02:42:30.601 [SchedulerImpl-0, OfferManager$OfferManagerImpl:185] 
> Returning offers for cadaf569-171d-42fc-a417-fbd608ea5bab-S1 for compaction.
> Sep 22, 2016 2:42:38 AM 
> com.google.common.util.concurrent.ServiceManager$ServiceListener failed
> SEVERE: Service SlotSizeCounterService [FAILED] has failed in the RUNNING 
> state.
> java.lang.NullPointerException: Unknown Mesos resource: name: "test"
> type: SCALAR
> scalar {
>   value: 200.0
> }
> role: "*"
>   at java.util.Objects.requireNonNull(Objects.java:228)
>   at 
> org.apache.aurora.scheduler.resources.ResourceType.fromResource(ResourceType.java:355)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.lambda$static$0(ResourceManager.java:52)
>   at com.google.common.collect.Iterators$7.computeNext(Iterators.java:675)
>   at 
> com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:143)
>   at 
> com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:138)
>   at java.util.Iterator.forEachRemaining(Iterator.java:115)
>   at 
> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
>   at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
>   at 
> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
>   at 
> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
>   at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
>   at 
> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.bagFromResources(ResourceManager.java:274)
>   at 
> org.apache.aurora.scheduler.resources.ResourceManager.bagFromMesosResources(ResourceManager.java:239)
>   at 
> org.apache.aurora.scheduler.stats.AsyncStatsModule$OfferAdapter.get(AsyncStatsModule.java:153)
>   at 
> org.apache.aurora.scheduler.stats.SlotSizeCounter.run(SlotSizeCounter.java:168)
>   at 
> org.apache.aurora.scheduler.stats.AsyncStatsModule$SlotSizeCounterService.runOneIteration(AsyncStatsModule.java:130)
>   at 
> com.google.common.util.concurrent.AbstractScheduledService$ServiceDelegate$Task.run(AbstractScheduledService.java:189)
>   at com.google.common.util.concurrent.Callables$3.run(Callables.java:100)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>   at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
>   at 
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
>   

[jira] [Updated] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes

2016-11-02 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1808:
-
Description: 
Thermos loses track of double forking processes, meaning on task teardown  the 
daemonized process will not receive a signal to shut down cleanly.

This can be a serious issue if one is running two processes: 
1. nginx which demonizes and accepts HTTP requests.
2. A backend processes that receives traffic from nginx over a local socket. 

On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to 
still accept traffic even though the backend is dead. If thermos could also 
send SIGTERM to 1, the task would tear down cleanly.

  was:
Thermos loses track of double forking processes, meaning on task teardown  the 
daemonized process will not receive a signal to shut down cleanly.

This can be a serious issue if one is running two processes: 
1. nginx which demonizes and accepts HTTP requests.
2. A back and processes that receives traffic from nginx over a local socket. 

On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to 
still accept traffic even though the backend is dead. If thermos could also 
send SIGTERM to 1, the task would tear down cleanly.


> Thermos executor should send SIGTERM to daemonized processes 
> -
>
> Key: AURORA-1808
> URL: https://issues.apache.org/jira/browse/AURORA-1808
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Thermos loses track of double forking processes, meaning on task teardown  
> the daemonized process will not receive a signal to shut down cleanly.
> This can be a serious issue if one is running two processes: 
> 1. nginx which demonizes and accepts HTTP requests.
> 2. A backend processes that receives traffic from nginx over a local socket. 
> On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to 
> still accept traffic even though the backend is dead. If thermos could also 
> send SIGTERM to 1, the task would tear down cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes

2016-11-02 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15630367#comment-15630367
 ] 

Zameer Manji commented on AURORA-1808:
--

WIP Solution here: https://reviews.apache.org/r/53403/

> Thermos executor should send SIGTERM to daemonized processes 
> -
>
> Key: AURORA-1808
> URL: https://issues.apache.org/jira/browse/AURORA-1808
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Zameer Manji
>
> Thermos loses track of double forking processes, meaning on task teardown  
> the daemonized process will not receive a signal to shut down cleanly.
> This can be a serious issue if one is running two processes: 
> 1. nginx which demonizes and accepts HTTP requests.
> 2. A back and processes that receives traffic from nginx over a local socket. 
> On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to 
> still accept traffic even though the backend is dead. If thermos could also 
> send SIGTERM to 1, the task would tear down cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1808) Thermos executor should send SIGTERM to daemonized processes

2016-11-02 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1808:


 Summary: Thermos executor should send SIGTERM to daemonized 
processes 
 Key: AURORA-1808
 URL: https://issues.apache.org/jira/browse/AURORA-1808
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Zameer Manji


Thermos loses track of double forking processes, meaning on task teardown  the 
daemonized process will not receive a signal to shut down cleanly.

This can be a serious issue if one is running two processes: 
1. nginx which demonizes and accepts HTTP requests.
2. A back and processes that receives traffic from nginx over a local socket. 

On task shutdown thermos will send SIGTERM to 2 and not 1, causing nginx to 
still accept traffic even though the backend is dead. If thermos could also 
send SIGTERM to 1, the task would tear down cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1107) Add support for mounting task specified external volumes into containers

2016-10-31 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15623909#comment-15623909
 ] 

Zameer Manji commented on AURORA-1107:
--

DSL + e2e tests https://reviews.apache.org/r/5/

> Add support for mounting task specified external volumes into containers
> 
>
> Key: AURORA-1107
> URL: https://issues.apache.org/jira/browse/AURORA-1107
> Project: Aurora
>  Issue Type: Task
>  Components: Docker
>Reporter: Steve Niemitz
>Assignee: Zameer Manji
>Priority: Minor
>
> The Mesos docker API allows specifying volumes on the host to mount into the 
> container when it runs.  We should expose this.  I propose:
>  - Add a volumes() set to the Docker object in base.py
>  - Add a similar set to the DockerContainer struct in api.thrift 
>  - Create a way for administrators to restrict the ability to use this.  
> Because mounts are set up by the docker daemon, they effectively allow 
> someone who can configure mounts to access anything on the machine.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1805) Enhance `Process` object to allow easier access to environment variables

2016-10-27 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15613357#comment-15613357
 ] 

Zameer Manji commented on AURORA-1805:
--

It still suffers the same string interpolation issues as constructing the 
command line.

> Enhance `Process` object to allow easier access to environment variables
> 
>
> Key: AURORA-1805
> URL: https://issues.apache.org/jira/browse/AURORA-1805
> Project: Aurora
>  Issue Type: Task
>  Components: Thermos
>Reporter: Zameer Manji
>
> The thermos DSL:
> {noformat}
> class Process(Struct):
>   cmdline = Required(String)
>   name= Required(String)
>   # This is currently unused but reserved for future use by Thermos.
>   resources = Resources
>   # optionals
>   max_failures  = Default(Integer, 1)  # maximum number of failed process 
> runs
># before process is failed.
>   daemon= Default(Boolean, False)
>   ephemeral = Default(Boolean, False)
>   min_duration  = Default(Integer, 5)  # integer seconds
>   final = Default(Boolean, False)  # if this process should be a 
> finalizing process
># that should always be run after 
> regular processes
>   logger= Default(Logger, Empty)
> {noformat}
> If we can add a new field:
> {noformat}
> environment = Default(Map(String, String), {})
> {noformat}
> It will make it much easier to add environment variables.
> Right now the solution is to prefix environment variables to cmdline which 
> can get janky and frustrating with the string interpolation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1762) /pendingtasks endpoint should show reason tasks are pending

2016-10-26 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1762?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1762:
-
Assignee: Pradyumna Kaushik

> /pendingtasks endpoint should show reason tasks are pending
> ---
>
> Key: AURORA-1762
> URL: https://issues.apache.org/jira/browse/AURORA-1762
> Project: Aurora
>  Issue Type: Task
>Reporter: David Robinson
>Assignee: Pradyumna Kaushik
>Priority: Minor
>  Labels: newbie
>
> the /pendingtasks endpoint is essentially useless as is, it shows that tasks 
> are pending but doesn't show why. The information is also not easily 
> discovered via the /scheduler UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1805) Enhance `Process` object to allow easier access to environment variables

2016-10-26 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1805:
-
Summary: Enhance `Process` object to allow easier access to environment 
variables  (was: Enhance `Process` object to allow easier access)

> Enhance `Process` object to allow easier access to environment variables
> 
>
> Key: AURORA-1805
> URL: https://issues.apache.org/jira/browse/AURORA-1805
> Project: Aurora
>  Issue Type: Task
>  Components: Thermos
>Reporter: Zameer Manji
>
> The thermos DSL:
> {noformat}
> class Process(Struct):
>   cmdline = Required(String)
>   name= Required(String)
>   # This is currently unused but reserved for future use by Thermos.
>   resources = Resources
>   # optionals
>   max_failures  = Default(Integer, 1)  # maximum number of failed process 
> runs
># before process is failed.
>   daemon= Default(Boolean, False)
>   ephemeral = Default(Boolean, False)
>   min_duration  = Default(Integer, 5)  # integer seconds
>   final = Default(Boolean, False)  # if this process should be a 
> finalizing process
># that should always be run after 
> regular processes
>   logger= Default(Logger, Empty)
> {noformat}
> If we can add a new field:
> {noformat}
> environment = Default(Map(String, String), {})
> {noformat}
> It will make it much easier to add environment variables.
> Right now the solution is to prefix environment variables to cmdline which 
> can get janky and frustrating with the string interpolation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1805) Enhance `Process` object to allow easier access

2016-10-26 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1805:
-
Description: 
The thermos DSL:
{noformat}
class Process(Struct):
  cmdline = Required(String)
  name= Required(String)

  # This is currently unused but reserved for future use by Thermos.
  resources = Resources

  # optionals
  max_failures  = Default(Integer, 1)  # maximum number of failed process 
runs
   # before process is failed.
  daemon= Default(Boolean, False)
  ephemeral = Default(Boolean, False)
  min_duration  = Default(Integer, 5)  # integer seconds
  final = Default(Boolean, False)  # if this process should be a 
finalizing process
   # that should always be run after 
regular processes
  logger= Default(Logger, Empty)
{noformat}

If we can add a new field:
{noformat}
environment = Default(Map(String, String), {})
{noformat}

It will make it much easier to add environment variables.

Right now the solution is to prefix environment variables to cmdline which can 
get janky and frustrating with the string interpolation.

  was:
The thermos DSL:
{noformat}
class Process(Struct):
  cmdline = Required(String)
  name= Required(String)

  # This is currently unused but reserved for future use by Thermos.
  resources = Resources

  # optionals
  max_failures  = Default(Integer, 1)  # maximum number of failed process 
runs
   # before process is failed.
  daemon= Default(Boolean, False)
  ephemeral = Default(Boolean, False)
  min_duration  = Default(Integer, 5)  # integer seconds
  final = Default(Boolean, False)  # if this process should be a 
finalizing process
   # that should always be run after 
regular processes
  logger= Default(Logger, Empty)
{noformat}

If we can add a new field:
{noformat}
process = Default(Map(String, String), {})
{noformat}

It will make it much easier to add environment variables.

Right now the solution is to prefix environment variables to cmdline which can 
get janky and frustrating with the string interpolation.


> Enhance `Process` object to allow easier access
> ---
>
> Key: AURORA-1805
> URL: https://issues.apache.org/jira/browse/AURORA-1805
> Project: Aurora
>  Issue Type: Task
>  Components: Thermos
>Reporter: Zameer Manji
>
> The thermos DSL:
> {noformat}
> class Process(Struct):
>   cmdline = Required(String)
>   name= Required(String)
>   # This is currently unused but reserved for future use by Thermos.
>   resources = Resources
>   # optionals
>   max_failures  = Default(Integer, 1)  # maximum number of failed process 
> runs
># before process is failed.
>   daemon= Default(Boolean, False)
>   ephemeral = Default(Boolean, False)
>   min_duration  = Default(Integer, 5)  # integer seconds
>   final = Default(Boolean, False)  # if this process should be a 
> finalizing process
># that should always be run after 
> regular processes
>   logger= Default(Logger, Empty)
> {noformat}
> If we can add a new field:
> {noformat}
> environment = Default(Map(String, String), {})
> {noformat}
> It will make it much easier to add environment variables.
> Right now the solution is to prefix environment variables to cmdline which 
> can get janky and frustrating with the string interpolation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1802) AttributeAggregate slows down scheduling of jobs with many instances

2016-10-26 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609746#comment-15609746
 ] 

Zameer Manji commented on AURORA-1802:
--

Thanks for the analysis [~StephanErb]!

I think reducing the number of SQL queries would yield the most benefit but we 
should implement all three of them.

> AttributeAggregate slows down scheduling of jobs with many instances
> 
>
> Key: AURORA-1802
> URL: https://issues.apache.org/jira/browse/AURORA-1802
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Reporter: Stephan Erb
>
> The current implementation of 
> [{{AttributeAggregate}}|https://github.com/apache/aurora/blob/f559e930659e25b3d7cacb7b845ebda50d18d66a/src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java]
>  slows down scheduling of jobs with many instances. Interestingly, this is 
> currently not visible in our job scheduling benchmark results as it only 
> affects the benchmark setup time but not the measured part.
> {{AttributeAggregate}} relies on {{Suppliers.memoize}} to ensure that it is 
> only computed once and only when necessary. This has probably been done 
> because the factory 
> [{{AttributeAggregate.getJobActiveState}}|https://github.com/apache/aurora/blob/f559e930659e25b3d7cacb7b845ebda50d18d66a/src/main/java/org/apache/aurora/scheduler/filter/AttributeAggregate.java#L56-L91]
>  is slow. 
> After some recent changes to schedule multiple task instances per scheduling 
> round the aggregate is computed in each scheduling round via the call 
> [{{resourceRequest.getJobState().updateAttributeAggregate(...)}} 
> |https://github.com/apache/aurora/blob/f559e930659e25b3d7cacb7b845ebda50d18d66a/src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java#L173]
>  in {{TaskAssigner}}. This means the expensive factory is called once per 
> scheduling round.
> h3. Potential improvements
> * the current factory implementation performs one {{fetchTasks}} query 
> followed by {{n}} distinct {{getHostAttributes}} queries. This could be 
> reduced to a single SQL query.
> * the aggregate makes heavy use of {{ImmutableMultiset}} even though it is 
> not immutable any more. There is potential room for improvement here.
> * The aggregate uses suppliers to perform a lazy instantiation even though 
> its current usage is not lazy any more. We can either make the implementation 
> eager, or ensure that the expensive part is only run when absolutely 
> necessary.
> h3. Proof of concept
> * 4 mins 23.407 secs -- total runtime of {{./gradlew jmh 
> -Pbenchmarks='SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark'}}
> * 2 mins 40.308 secs -- total runtime of {{./gradlew jmh 
> -Pbenchmarks='SchedulingBenchmarks.InsufficientResourcesSchedulingBenchmark'}}
>  with [{{resourceRequest.getJobState().updateAttributeAggregate(...)}} 
> |https://github.com/apache/aurora/blob/f559e930659e25b3d7cacb7b845ebda50d18d66a/src/main/java/org/apache/aurora/scheduler/state/TaskAssigner.java#L173]
>  commented out. This works as the call is not necessary when only a single 
> instance is scheduled per scheduling round, as done in the benchmarks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1801) TaskObserver thread stops refreshing after filesystem race condition

2016-10-26 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15609736#comment-15609736
 ] 

Zameer Manji commented on AURORA-1801:
--

I am a big fan of making the process fail if the `TaskObserver` thread fails.

That matches up with patterns elsewhere in the code.

We can also prevent the race condition too.

> TaskObserver thread stops refreshing after filesystem race condition
> 
>
> Key: AURORA-1801
> URL: https://issues.apache.org/jira/browse/AURORA-1801
> Project: Aurora
>  Issue Type: Bug
>  Components: Observer
>Reporter: Stephan Erb
>
> It seems like that a race condition accessing the Mesos filesystem layout can 
> bubble up and terminate the {{TaskObserver}} thread responsible for 
> refreshing the internal data structure of available tasks. Restarting the 
> observer fixes the problem.
> Exception triggering the issue:
> {code}
> Traceback (most recent call last):
>   File 
> "/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.bce9e54ac7cded79a75603fb4e6bcef2c7d1e6bc/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py",
>  line 126, in _excepting_run
> self.__real_run(*args, **kw)
>   File "apache/thermos/observer/task_observer.py", line 135, in run
>   File "apache/thermos/observer/detector.py", line 74, in refresh
>   File "apache/thermos/observer/detector.py", line 58, in _refresh_detectors
>   File "apache/aurora/executor/common/path_detector.py", line 34, in get_paths
>   File "apache/aurora/executor/common/path_detector.py", line 34, in 
>   File "apache/aurora/executor/common/path_detector.py", line 33, in iterate
>   File "/usr/lib/python2.7/posixpath.py", line 376, in realpath
> resolved = _resolve_link(component)
>   File "/usr/lib/python2.7/posixpath.py", line 399, in _resolve_link
> resolved = os.readlink(path)
> OSError: [Errno 2] No such file or directory: 
> '/var/lib/mesos/slaves/0768bcb3-205d-4409-a726-3001ad3ef902-S10/frameworks/20151001-085346-58917130-5050-37976-/executors/thermos-role-env-myname-0-f9fe0318-d39f-49d3-bdf8-e954d5879b33/runs/latest'
> {code}
> Solution space:
> * terminate the observer process if the {{TaskOberver}} thread fails
> * prevent unknown exceptions from aborting the  {{TaskOberver}} run loop
> * prevent the observed race condition in {{detector.py}} or 
> {{path_detector.py}}
>   



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1380) Upgrade to guice 4.0

2016-10-21 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15595775#comment-15595775
 ] 

Zameer Manji commented on AURORA-1380:
--

The upstream ticket SHIRO-493 has been resolved and an RC/Release for shiro 1.4 
is soon.

We will be able to close this ticket then.

> Upgrade to guice 4.0
> 
>
> Key: AURORA-1380
> URL: https://issues.apache.org/jira/browse/AURORA-1380
> Project: Aurora
>  Issue Type: Story
>  Components: Scheduler
>Reporter: Kevin Sweeney
>Priority: Critical
>
> Guice 4.0 has been released. Among the new features, probably the most 
> significant is Java 8 support - in Guice 3.0 stack traces are obfuscated by 
> https://github.com/google/guice/issues/757. As our code expands use of 
> lambdas and method references this will become even more critical.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1799) Thermos does not handle low memory scenarios gracefully

2016-10-18 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1799:


 Summary: Thermos does not handle low memory scenarios gracefully
 Key: AURORA-1799
 URL: https://issues.apache.org/jira/browse/AURORA-1799
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


Background:
In an environment where Aurora is used to launch Docker containers via the 
DockerContainerizer, it was observed that some tasks would not be killed.

What happened is that a task was allocated with a low amount of memory but 
demanded a lot. This caused the linux OOM killer to be invoked. Unlike the 
MesosContainerizer, the agent doesn't tear down the container when the OOM 
killer is invoked. Instead the OOM killer just kills a process in the container 
and thermos and mesos are unaware (unless a process directly launched by 
thermos is killed).

I observed in the scheduler logs that the scheduler was trying to kill a 
container every reconciliation period but it never died. The slave had the logs 
indicating it received the killTask RPC and forwarded it to Thermos.

The thermos logs had several entries like every hour:
{noformat}
I1018 20:39:18.102894 6 executor_base.py:45] Executor 
[aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: Activating kill manager.
I1018 20:39:18.103034 6 executor_base.py:45] Executor 
[aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: killTask returned.
I1018 21:39:17.859935 6 executor_base.py:45] Executor 
[aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: killTask got task_id: value: 
""
{noformat}

However, the tasks was never killed. Looking at the stderr of thermos I saw the 
following entries:
{noformat}
Logged from file resource.py, line 155
Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/__init__.py", line 883, in emit
self.flush()
  File "/usr/lib/python2.7/logging/__init__.py", line 843, in flush
self.stream.flush()
IOError: [Errno 12] Cannot allocate memory
{noformat}

and 
{noformat}
Logged from file thermos_task_runner.py, line 171
Traceback (most recent call last):
  File 
"/root/.pex/install/twitter.common.exceptions-0.3.3-py2-none-any.whl.2a67b833b1517d179ef1c8dc6f2dac1023d51e3c/twitter.common.exceptions-0.3.3-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 126, in _excepting_run

  File "apache/aurora/executor/status_manager.py", line 47, in run
  File "apache/aurora/executor/common/status_checker.py", line 97, in status
  File "apache/aurora/executor/thermos_task_runner.py", line 358, in status
  File "apache/aurora/executor/thermos_task_runner.py", line 186, in 
compute_status
  File "apache/aurora/executor/thermos_task_runner.py", line 136, in task_state
  File "apache/thermos/monitoring/monitor.py", line 118, in task_state
  File "apache/thermos/monitoring/monitor.py", line 114, in get_state
  File "apache/thermos/monitoring/monitor.py", line 77, in _apply_states
  File 
"/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py",
 line 182, in try_read
class InvalidTypeException(Error): pass
  File 
"/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py",
 line 168, in read
return RecordIO.Reader.do_read(self._fp, self._codec)
  File 
"/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py",
 line 135, in do_read
header = fp.read(RecordIO.RECORD_HEADER_SIZE)
  File 
"/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/filelike.py",
 line 81, in read
return self._fp.read(length)
IOError: [Errno 12] Cannot allocate memory
{noformat}

It seems the regular avenues of reading checkpoints or logging data, thermos 
would get an IOError. Some part of twitter common installs an excepthook to log 
the exception, but we don't seem to do anything else.

I think we should probably install our own exception hook to send a 
{{LOST_TASK}} with the exception information instead of failing to kill the 
task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (AURORA-1795) Internal server error in scheduler Thrift API on missing Content-Type

2016-10-17 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1795?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji reassigned AURORA-1795:


Assignee: Zameer Manji

> Internal server error in scheduler Thrift API on missing Content-Type
> -
>
> Key: AURORA-1795
> URL: https://issues.apache.org/jira/browse/AURORA-1795
> Project: Aurora
>  Issue Type: Bug
>  Components: Scheduler
>Affects Versions: 0.16.0
>Reporter: Stephan Erb
>Assignee: Zameer Manji
>
> This happens if a user has a very old browser, i.e. Firefox 41.
> {code}
> I1017 09:38:15.618 [qtp1426166274-44336, Slf4jRequestLog:60] 10.x.x.x - - 
> [17/Oct/2016:09:38:15 +] "POST //foobar.example.org/api HTTP/1.1" 200 794
> W1017 09:38:15.627 [qtp1426166274-44066, ServletHandler:631] /api 
> java.lang.NullPointerException: null
> at java.util.Objects.requireNonNull(Objects.java:203) 
> ~[na:1.8.0-internal]
> at java.util.Optional.(Optional.java:96) ~[na:1.8.0-internal]
> at java.util.Optional.of(Optional.java:108) ~[na:1.8.0-internal]
> at 
> org.apache.aurora.scheduler.http.api.TContentAwareServlet.doPost(TContentAwareServlet.java:123)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.api.TContentAwareServlet.doGet(TContentAwareServlet.java:164)
>  ~[aurora-0.16.0.jar:na]
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:687) 
> ~[javax.servlet-api-3.1.0.jar:3.1.0]
> at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) 
> ~[javax.servlet-api-3.1.0.jar:3.1.0]
> at 
> com.google.inject.servlet.ServletDefinition.doService(ServletDefinition.java:263)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.ServletDefinition.service(ServletDefinition.java:178)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.ManagedServletPipeline.service(ManagedServletPipeline.java:91)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:62)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.LeaderRedirectFilter.doFilter(LeaderRedirectFilter.java:72)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44)
>  ~[aurora-0.16.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.HttpStatsFilter.doFilter(HttpStatsFilter.java:71)
>  ~[aurora-0.16.0.jar:na]
> at 
> org.apache.aurora.scheduler.http.AbstractFilter.doFilter(AbstractFilter.java:44)
>  ~[aurora-0.16.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:163)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterDefinition.doFilter(FilterDefinition.java:168)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.FilterChainInvocation.doFilter(FilterChainInvocation.java:58)
>  ~[guice-servlet-3.0.jar:na]
> at 
> com.google.inject.servlet.ManagedFilterPipeline.dispatch(ManagedFilterPipeline.java:118)
>  

[jira] [Commented] (AURORA-1796) Several JMH microbenchmarks are failing

2016-10-17 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1796?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15583606#comment-15583606
 ] 

Zameer Manji commented on AURORA-1796:
--

This is a guice binding error that is obscured by the fact we are on JDK 8 with 
not Guice 4.0.

> Several JMH microbenchmarks are failing
> ---
>
> Key: AURORA-1796
> URL: https://issues.apache.org/jira/browse/AURORA-1796
> Project: Aurora
>  Issue Type: Bug
>Reporter: Stephan Erb
>
> In the context of https://reviews.apache.org/r/52921/ I tried to run our 
> micro benchmarks:
> * {{UpdateStoreBenchmarks}} seems to work as expected
> * {{StatusUpdateBenchmark}} seems to work ax expected
> * {{TaskStoreBenchmarks}} seems wo work as expected otherwise. However, the 
> ops/sec for the h2 based tests seems to be off by a great margin. 
> * {{SchedulingBenchmarks}} seems to take for ever. I aborted after 4 hours 
> * {{SnapshotBenchmarks}} fails with the exception below 
> * {{ThriftApiBenchmarks}} fails with the exception below 
> This ticket is about the last two failing benchmarks.  The following 
> exception is written for each benchmark, indicating a problem in guice:
> {code}
> com.google.inject.internal.util.$ComputationException: 
> java.lang.ArrayIndexOutOfBoundsException: 44204
>   at 
> com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:553)
>   at 
> com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:419)
>   at 
> com.google.inject.internal.util.$CustomConcurrentHashMap$ComputingImpl.get(CustomConcurrentHashMap.java:2041)
>   at 
> com.google.inject.internal.util.$StackTraceElements.forMember(StackTraceElements.java:53)
>   at 
> com.google.inject.internal.Errors.formatInjectionPoint(Errors.java:716)
>   at 
> com.google.inject.internal.Errors.formatSource(Errors.java:678)
>   at com.google.inject.internal.Errors.format(Errors.java:555)
>   at 
> com.google.inject.CreationException.getMessage(CreationException.java:48)
>   at java.lang.Throwable.getLocalizedMessage(Throwable.java:391)
>   at java.lang.Throwable.toString(Throwable.java:480)
>   at java.lang.Throwable.(Throwable.java:311)
>   at java.lang.Exception.(Exception.java:102)
>   at java.lang.RuntimeException.(RuntimeException.java:96)
>   at 
> org.openjdk.jmh.runner.BenchmarkException.(BenchmarkException.java:34)
>   at 
> org.openjdk.jmh.runner.BenchmarkHandler.runIteration(BenchmarkHandler.java:438)
>   at 
> org.openjdk.jmh.runner.BaseRunner.runBenchmark(BaseRunner.java:263)
>   at 
> org.openjdk.jmh.runner.BaseRunner.runBenchmark(BaseRunner.java:235)
>   at 
> org.openjdk.jmh.runner.BaseRunner.doSingle(BaseRunner.java:142)
>   at 
> org.openjdk.jmh.runner.BaseRunner.runBenchmarksForked(BaseRunner.java:76)
>   at org.openjdk.jmh.runner.ForkedRunner.run(ForkedRunner.java:72)
>   at org.openjdk.jmh.runner.ForkedMain.main(ForkedMain.java:84)
> Caused by: java.lang.ArrayIndexOutOfBoundsException: 44204
>   at com.google.inject.internal.asm.$ClassReader.(Unknown 
> Source)
>   at com.google.inject.internal.asm.$ClassReader.(Unknown 
> Source)
>   at com.google.inject.internal.asm.$ClassReader.(Unknown 
> Source)
>   at 
> com.google.inject.internal.util.$LineNumbers.(LineNumbers.java:62)
>   at 
> com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:36)
>   at 
> com.google.inject.internal.util.$StackTraceElements$1.apply(StackTraceElements.java:33)
>   at 
> com.google.inject.internal.util.$MapMaker$StrategyImpl.compute(MapMaker.java:549)
>   ... 20 more
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop

2016-10-12 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570354#comment-15570354
 ] 

Zameer Manji commented on AURORA-1789:
--

I have updated the title and assignee to reflect reality. Thanks for 
investigating and self serving [~jpinkul]!

> Incorrect --mesos_containerizer_path value results in thermos failure loop
> --
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1789) Incorrect --mesos_containerizer_path value results in thermos failure loop

2016-10-12 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1789:
-
Summary: Incorrect --mesos_containerizer_path value results in thermos 
failure loop  (was: namespaces/pid isolator causes lost process)

> Incorrect --mesos_containerizer_path value results in thermos failure loop
> --
>
> Key: AURORA-1789
> URL: https://issues.apache.org/jira/browse/AURORA-1789
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Zameer Manji
>
> When using the Mesos containerizer with namespaces/pid isolator and a Docker 
> image the Thermos executor is unable to launch processes. The executor tries 
> to fork the process then is unable to locate the process after the fork.
> {code:title=thermos_runner.INFO}
> I1006 21:36:22.842595 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:37:22.929864 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=205, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1144, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789782.842882)
> I1006 21:37:22.931456 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1144] completed.
> I1006 21:37:22.931732 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:37:22.935580 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:38:23.023725 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=208, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1157, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789842.935872)
> I1006 21:38:23.025332 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1157] completed.
> I1006 21:38:23.025629 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:38:23.029414 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:39:23.117208 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=211, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1170, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789903.029694)
> I1006 21:39:23.118841 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1170] completed.
> I1006 21:39:23.119134 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:39:23.122920 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:40:23.211095 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=214, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1183, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475789963.123206)
> I1006 21:40:23.212711 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1183] completed.
> I1006 21:40:23.213006 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:40:23.216810 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:41:23.305505 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=217, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1196, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790023.21709)
> I1006 21:41:23.307157 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1196] completed.
> I1006 21:41:23.307450 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> I1006 21:41:23.311230 75 runner.py:865] Forking Process(BigBrother start)
> I1006 21:42:23.398277 75 runner.py:825] Detected a LOST task: 
> ProcessStatus(seq=220, process=u'BigBrother start', start_time=None, 
> coordinator_pid=1209, pid=None, return_code=None, state=1, stop_time=None, 
> fork_time=1475790083.311512)
> I1006 21:42:23.399893 75 helper.py:153]   Coordinator BigBrother start [pid: 
> 1209] completed.
> I1006 21:42:23.400185 75 runner.py:133] Process BigBrother start had an 
> abnormal termination
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1785) Populate curator latches with scheduler information

2016-10-12 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15570275#comment-15570275
 ] 

Zameer Manji commented on AURORA-1785:
--

I don't think it's "too much', it is exactly what the leader would advertise.

> Populate curator latches with scheduler information
> ---
>
> Key: AURORA-1785
> URL: https://issues.apache.org/jira/browse/AURORA-1785
> Project: Aurora
>  Issue Type: Task
>Reporter: Zameer Manji
>Assignee: Jing Chen
>Priority: Minor
>  Labels: newbie
>
> If you look at the mesos ZK node for leader election you see something like 
> this:
> {noformat}
>  u'json.info_000104',
>  u'json.info_000102',
>  u'json.info_000101',
>  u'json.info_98',
>  u'json.info_97'
> {noformat}
> Each of these nodes contains data about the machine contending for 
> leadership. It is a JSON serialized {{MasterInfo}} protobuf. This means an 
> operator can inspect who is contending for leadership by checking the content 
> of the nodes.
> When you check the aurora ZK node you see something like this:
> {noformat}
>  u'_c_2884a0d3-b5b0-4445-b8d6-b271a6df6220-latch-000774',
>  u'_c_86a21335-c5a2-4bcb-b471-4ce128b67616-latch-000776',
>  u'_c_a4f8b0f7-d063-4df2-958b-7b3e6f666a95-latch-000775',
>  u'_c_120cd9da-3bc1-495b-b02f-2142fb22c0a0-latch-000784',
>  u'_c_46547c31-c5c2-4fb1-8a53-237e3cb0292f-latch-000780',
>  u'member_000781'
> {noformat}
> Only the leader node contains information. The curator latches contain no 
> information. It is not possible to figure out which machines are contending 
> for leadership purely from ZK.
> I think we should attach data to the latches like mesos.
> Being able to do this is invaluable to debug issues if an extra master is 
> added to the cluster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1792) Executor does not log full task information.

2016-10-11 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1792:


 Summary: Executor does not log full task information.
 Key: AURORA-1792
 URL: https://issues.apache.org/jira/browse/AURORA-1792
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


I launched a task that has an {{initial_interval_secs}} in the health check 
config. However the log contains no information about this field:

{noformat}
$ grep "initial_interval_secs" __main__.log
{noformat}

We should log the entire ExecutorInfo blob.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-11 Thread Zameer Manji (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zameer Manji updated AURORA-1791:
-
Description: 
The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
 is not backwards compatible. The last section of the commit 

{quote}
4. Modified the Health Checker and redefined the meaning initial_interval_secs.
{quote}

has serious, unintended consequences.

Consider the following health check config:
{noformat}
  initial_interval_secs: 10
  interval_secs: 5
  max_consecutive_failures: 1
{noformat}

On the 0.16.0 executor, no health checking will occur for the first 10 seconds. 
Here the earliest a task can cause failure is at the 10th second.

On master, health checking starts right away which means the task can fail at 
the first second since {{max_consecutive_failures}} is set to 1.

This is not backwards compatible and needs to be fixed.

I think a good solution would be to revert the meaning change to 
initial_interval_secs and have the task transition into RUNNING when 
{{max_consecutive_successes}} is met.

An investigation shows {{initial_interval_secs}} was set to 5 but the task 
failed health checks right away:

{noformat}
D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
Performing health check.
D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
counter.
D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
consecutive successes.
{noformat}


  was:
The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
 is not backwards compatible. The last section of the commit 

{quote}
4. Modified the Health Checker and redefined the meaning initial_interval_secs.
{quote}

has serious, unintended consequences.

Consider the following health check config:
{noformat}
  initial_interval_secs: 10
  interval_secs: 5
  max_consecutive_failures: 1
{noformat}

On the 0.16.0 executor, no health checking will occur for the first 10 seconds. 
Here the earliest a task can cause failure is at the 10th second.

On master, health checking starts right away which means the task can fail at 
the first second since {{max_consecutive_failures}} is set to 1.

This is not backwards compatible and needs to be fixed.

I think a good solution would be to revert the meaning change to 
initial_interval_secs and have the task transition into RUNNING when 
{{max_consecutive_successes}} is met.



> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.
> An investigation shows {{initial_interval_secs}} was set to 5 but the task 
> failed health checks right away:
> {noformat}
> D1011 19:52:13.295877 6 health_checker.py:107] Health checks enabled. 
> Performing health check.
> D1011 19:52:13.306816 6 health_checker.py:126] Reset consecutive failures 
> counter.
> D1011 19:52:13.307032 6 health_checker.py:132] Initial interval expired.
> W1011 19:52:13.307130 6 health_checker.py:135] Failed to reach minimum 
> consecutive successes.
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-11 Thread Zameer Manji (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1557#comment-1557
 ] 

Zameer Manji commented on AURORA-1791:
--

Note, I could be wrong here but this was deployed to a cluster and tasks that 
were healthy before started to fail.

> Commit ca683 is not backwards compatible.
> -
>
> Key: AURORA-1791
> URL: https://issues.apache.org/jira/browse/AURORA-1791
> Project: Aurora
>  Issue Type: Bug
>Reporter: Zameer Manji
>Assignee: Kai Huang
>Priority: Blocker
>
> The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
> https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
>  is not backwards compatible. The last section of the commit 
> {quote}
> 4. Modified the Health Checker and redefined the meaning 
> initial_interval_secs.
> {quote}
> has serious, unintended consequences.
> Consider the following health check config:
> {noformat}
>   initial_interval_secs: 10
>   interval_secs: 5
>   max_consecutive_failures: 1
> {noformat}
> On the 0.16.0 executor, no health checking will occur for the first 10 
> seconds. Here the earliest a task can cause failure is at the 10th second.
> On master, health checking starts right away which means the task can fail at 
> the first second since {{max_consecutive_failures}} is set to 1.
> This is not backwards compatible and needs to be fixed.
> I think a good solution would be to revert the meaning change to 
> initial_interval_secs and have the task transition into RUNNING when 
> {{max_consecutive_successes}} is met.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1791) Commit ca683 is not backwards compatible.

2016-10-11 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1791:


 Summary: Commit ca683 is not backwards compatible.
 Key: AURORA-1791
 URL: https://issues.apache.org/jira/browse/AURORA-1791
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Kai Huang
Priority: Blocker


The commit [ca683cb9e27bae76424a687bc6c3af5a73c501b9 | 
https://github.com/apache/aurora/commit/ca683cb9e27bae76424a687bc6c3af5a73c501b9]
 is not backwards compatible. The last section of the commit 

{quote}
4. Modified the Health Checker and redefined the meaning initial_interval_secs.
{quote}

has serious, unintended consequences.

Consider the following health check config:
{noformat}
  initial_interval_secs: 10
  interval_secs: 5
  max_consecutive_failures: 1
{noformat}

On the 0.16.0 executor, no health checking will occur for the first 10 seconds. 
Here the earliest a task can cause failure is at the 10th second.

On master, health checking starts right away which means the task can fail at 
the first second since {{max_consecutive_failures}} is set to 1.

This is not backwards compatible and needs to be fixed.

I think a good solution would be to revert the meaning change to 
initial_interval_secs and have the task transition into RUNNING when 
{{max_consecutive_successes}} is met.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   >