[jira] [Commented] (MESOS-2246) Improve slave health-checking

2015-06-16 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14588361#comment-14588361
 ] 

Joe Smith commented on MESOS-2246:
--

[~vinodkone] [~jieyu] given the tickets in this epic are completed, can this be 
resolved?

 Improve slave health-checking
 -

 Key: MESOS-2246
 URL: https://issues.apache.org/jira/browse/MESOS-2246
 Project: Mesos
  Issue Type: Epic
  Components: master, slave
Reporter: Dominic Hamon

 In the event of a network partition, or other systemic issues, we may see  
 widespread slave removal. There are several approaches we can take to 
 mitigate this issue including, but not limited to:
 . rate limit the slave removal
 . change how we do health checking to not rely on a single point of view
 . work with frameworks to determine SLA of running services before removing 
 the slave
 . manual control to allow operator intervention 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2755) Slave should perform a diff in logs when incompatible SlaveInfo is detected

2015-05-20 Thread Joe Smith (JIRA)
Joe Smith created MESOS-2755:


 Summary: Slave should perform a diff in logs when incompatible 
SlaveInfo is detected
 Key: MESOS-2755
 URL: https://issues.apache.org/jira/browse/MESOS-2755
 Project: Mesos
  Issue Type: Story
  Components: slave
Reporter: Joe Smith


When diagnosing slaves with attribute changes, it'd be super helpful for the 
logs to contain a diff which displayed the changes in the {{SlaveInfo}} instead 
of only printing out the old and new contents of {{SlaveInfo}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1739) Allow slave reconfiguration on restart

2015-05-09 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14536242#comment-14536242
 ] 

Joe Smith commented on MESOS-1739:
--

Howdy all,

What's the status of this? This change would greatly increase flexibility for 
us operators!

Thanks,
Joe

 Allow slave reconfiguration on restart
 --

 Key: MESOS-1739
 URL: https://issues.apache.org/jira/browse/MESOS-1739
 Project: Mesos
  Issue Type: Epic
Reporter: Patrick Reilly
Assignee: Cody Maloney

 Make it so that either via a slave restart or a out of process reconfigure 
 ping, the attributes and resources of a slave can be updated to be a superset 
 of what they used to be.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2676) slave recovery always fails when resources change

2015-04-30 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14522498#comment-14522498
 ] 

Joe Smith commented on MESOS-2676:
--

[~vinodkone] made [a compelling explanation 
previously|http://mail-archives.apache.org/mod_mbox/mesos-user/201406.mbox/%3ccaakwvazphphzfhnbr46amfxwg3bynppbn+jl4wc8qpor9cs...@mail.gmail.com%3E]-
 but I don't think we should let this go.

What if the resources increase by (for example) 1.0 CPU? Surely that's a safe 
change, and allowing changes of some nature would greatly increase the 
flexibility of operators.

 slave recovery always fails when resources change
 -

 Key: MESOS-2676
 URL: https://issues.apache.org/jira/browse/MESOS-2676
 Project: Mesos
  Issue Type: Improvement
Reporter: David Robinson

 Slave recovery fails whenever --resources is changed. Ideally recovery would 
 only fail if --resources has changed _and_ the still-executing tasks no 
 longer fit within the new --resources range. Increasing resources should 
 always be allowed. For example, if a slave was started with 
 --resources=cpus:15, then the slave was restarted w/ --resources=cpus:16, the 
 slave should start successfully. Same for mem, ports, disk and 
 ephemeral_ports.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2367) Improve slave resiliency in the face of orphan containers

2015-02-17 Thread Joe Smith (JIRA)
Joe Smith created MESOS-2367:


 Summary: Improve slave resiliency in the face of orphan containers 
 Key: MESOS-2367
 URL: https://issues.apache.org/jira/browse/MESOS-2367
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Joe Smith


Right now there's a case where a misbehaving executor can cause a slave process 
to flap:

{panel:title=Quote From [~jieyu]}
{quote}
1) User tries to kill an instance
2) Slave sends {{KillTaskMessage}} to executor
3) Executor sends kill signals to task processes
4) Executor sends {{TASK_KILLED}} to slave
5) Slave updates container cpu limit to be 0.01 cpus
6) A user-process is still processing the kill signal
7) the task process cannot exit since it has too little cpu share and is 
throttled
8) Executor itself terminates
9) Slave tries to destroy the container, but cannot because the user-process is 
stuck in the exit path.
10) Slave restarts, and is constantly flapping because it cannot kill orphan 
containers
{quote}
{panel}

The slave's orphan container handling should be improved to deal with this case 
despite ill-behaved users (framework writers).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1474) Provide cluster maintenance primitives for operators.

2014-09-09 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14127701#comment-14127701
 ] 

Joe Smith commented on MESOS-1474:
--

The doc looks great! Thanks for all the hard work and rigor!

 Provide cluster maintenance primitives for operators.
 -

 Key: MESOS-1474
 URL: https://issues.apache.org/jira/browse/MESOS-1474
 Project: Mesos
  Issue Type: Epic
  Components: framework, master, slave
Reporter: Benjamin Mahler

 Normally cluster upgrades can be done seamlessly using the built-in slave 
 recovery feature. However, there are situations where operators want to be 
 able to perform destructive maintenance operations on machines:
 * Non-recoverable slave upgrades.
 * Machine reboots.
 * Kernel upgrades.
 * Machine decommissioning.
 * etc.
 In these situations, best practice is to perform rolling maintenance in large 
 batches of machines. This can be problematic for frameworks when many related 
 tasks are located within a batch of machines going for maintenance.
 There are a few primitives of interest here:
 * Provide a way for operators to fully shutdown a slave (killing all tasks 
 underneath it).
 * Provide a way for operators to mark specific slaves as undergoing 
 maintenance. This means that no more offers are being sent for these slaves, 
 and no new tasks will launch on them.
 * Provide a way for frameworks to be notified when resources are requested to 
 be relinquished. This gives the framework to proactively move a task before 
 it is forcibly killed. It also allows the automation of operations like: 
 please drain these slaves within 1 hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1758) Freezer failure leads to lost task during container destruction.

2014-09-07 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124977#comment-14124977
 ] 

Joe Smith commented on MESOS-1758:
--

Can we make sure this gets into 0.21.0? This is continuing to hit us with LOST 
tasks, so just want to make sure it gets included.

Thanks!

 Freezer failure leads to lost task during container destruction.
 

 Key: MESOS-1758
 URL: https://issues.apache.org/jira/browse/MESOS-1758
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Benjamin Mahler

 In the past we've seen numerous issues around the freezer. Lately, on the 
 2.6.44 kernel, we've seen issues where we're unable to freeze the cgroup:
 (1) An oom occurs.
 (2) No indication of oom in the kernel logs.
 (3) The slave is unable to freeze the cgroup.
 (4) The task is marked as lost.
 {noformat}
 I0903 16:46:24.956040 25469 mem.cpp:575] Memory limit exceeded: Requested: 
 15488MB Maximum Used: 15488MB
 MEMORY STATISTICS:
 cache 7958691840
 rss 8281653248
 mapped_file 9474048
 pgpgin 4487861
 pgpgout 522933
 pgfault 2533780
 pgmajfault 11
 inactive_anon 0
 active_anon 8281653248
 inactive_file 7631708160
 active_file 326852608
 unevictable 0
 hierarchical_memory_limit 16240345088
 total_cache 7958691840
 total_rss 8281653248
 total_mapped_file 9474048
 total_pgpgin 4487861
 total_pgpgout 522933
 total_pgfault 2533780
 total_pgmajfault 11
 total_inactive_anon 0
 total_active_anon 8281653248
 total_inactive_file 7631728640
 total_active_file 326852608
 total_unevictable 0
 I0903 16:46:24.956848 25469 containerizer.cpp:1041] Container 
 bbb9732a-d600-4c1b-b326-846338c608c3 has reached its limit for resource 
 mem(*):1.62403e+10 and will be terminated
 I0903 16:46:24.957427 25469 containerizer.cpp:909] Destroying container 
 'bbb9732a-d600-4c1b-b326-846338c608c3'
 I0903 16:46:24.958664 25481 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:34.959529 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:34.962070 25482 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.710848ms
 I0903 16:46:34.962658 25479 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:44.963349 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:44.965631 25472 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.588224ms
 I0903 16:46:44.966356 25472 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:54.967254 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:46:56.008447 25475 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 2.15296ms
 I0903 16:46:56.009071 25466 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:06.010329 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:06.012538 25467 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.643008ms
 I0903 16:47:06.013216 25467 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:12.516348 25480 slave.cpp:3030] Current usage 9.57%. Max allowed 
 age: 5.630238827780799days
 I0903 16:47:16.015192 25488 cgroups.cpp:2209] Thawing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:16.017043 25486 cgroups.cpp:1404] Successfullly thawed cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3 after 
 1.511168ms
 I0903 16:47:16.017555 25480 cgroups.cpp:2192] Freezing cgroup 
 /sys/fs/cgroup/freezer/mesos/bbb9732a-d600-4c1b-b326-846338c608c3
 I0903 16:47:19.862746 25483 http.cpp:245] HTTP request for 
 '/slave(1)/stats.json'
 E0903 16:47:24.960055 25472 slave.cpp:2557] Termination of executor 'E' of 
 framework '201104070004-002563-' failed: Failed to destroy container: 
 discarded future
 I0903 16:47:24.962054 25472 slave.cpp:2087] Handling status update TASK_LOST 
 (UUID: c0c1633b-7221-40dc-90a2-660ef639f747) for task T of framework 
 201104070004-002563- from @0.0.0.0:0
 I0903 16:47:24.963470 25469 mem.cpp:293] Updated 'memory.soft_limit_in_bytes' 
 to 128MB for container bbb9732a-d600-4c1b-b326-846338c608c3
 

[jira] [Commented] (MESOS-1765) Use PID namespace to avoid freezing cgroup

2014-09-07 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1765?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14124979#comment-14124979
 ] 

Joe Smith commented on MESOS-1765:
--

[~wangcong] can you share a link to the kernel bug? (Or a pointer to more 
discussion?) Sounds like we should also keep tabs on fixing that as well.

 Use PID namespace to avoid freezing cgroup
 --

 Key: MESOS-1765
 URL: https://issues.apache.org/jira/browse/MESOS-1765
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Cong Wang

 There is some known kernel issue when we freeze the whole cgroup upon OOM. 
 Mesos probably can just use PID namespace so that we will only need to kill 
 the init of the pid namespace, instead of freezing all the processes and 
 killing them one by one. But I am not quite sure if this would break the 
 existing code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-1678) Any framework with credentials can kill any other framework via http

2014-08-06 Thread Joe Smith (JIRA)
Joe Smith created MESOS-1678:


 Summary: Any framework with credentials can kill any other 
framework via http
 Key: MESOS-1678
 URL: https://issues.apache.org/jira/browse/MESOS-1678
 Project: Mesos
  Issue Type: Story
Reporter: Joe Smith


Looking through [the review introducing the /shutdown http 
endpoint|https://reviews.apache.org/r/22832], it appears that any framework's 
credentials can be used to shutdown any other framework:

Around line 650:
{code}
  foreach (const Credential credential, master-credentials.get().http()) {
if (credential.principal() == username 
(!credential.has_secret() || credential.secret() == password)) {
  // TODO(ijimenez) make removeFramework asynchronously
  master-removeFramework(framework);
  return OK();
}
  }
{code}

Thanks to [~adam-mesos], I looked into in [authorization 
doc|https://github.com/apache/mesos/blob/master/docs/authorization.md], however 
I don't see where the ACL-checking is happening within that code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Closed] (MESOS-1678) Any framework with credentials can kill any other framework via http

2014-08-06 Thread Joe Smith (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joe Smith closed MESOS-1678.


Resolution: Duplicate

 Any framework with credentials can kill any other framework via http
 

 Key: MESOS-1678
 URL: https://issues.apache.org/jira/browse/MESOS-1678
 Project: Mesos
  Issue Type: Story
Reporter: Joe Smith

 Looking through [the review introducing the /shutdown http 
 endpoint|https://reviews.apache.org/r/22832], it appears that any framework's 
 credentials can be used to shutdown any other framework:
 Around line 650:
 {code}
   foreach (const Credential credential, master-credentials.get().http()) {
 if (credential.principal() == username 
 (!credential.has_secret() || credential.secret() == password)) {
   // TODO(ijimenez) make removeFramework asynchronously
   master-removeFramework(framework);
   return OK();
 }
   }
 {code}
 Thanks to [~adam-mesos], I looked into in [authorization 
 doc|https://github.com/apache/mesos/blob/master/docs/authorization.md], 
 however I don't see where the ACL-checking is happening within that code.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1343) Authorize HTTP endpoints through ACLs

2014-08-06 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088411#comment-14088411
 ] 

Joe Smith commented on MESOS-1343:
--

For 0.20.0, we either need to enable the ACLs, or remove the shutdown from 
http- as is, any framework's credentials can be used to shutdown other 
frameworks.

 Authorize HTTP endpoints through ACLs
 -

 Key: MESOS-1343
 URL: https://issues.apache.org/jira/browse/MESOS-1343
 Project: Mesos
  Issue Type: Story
Reporter: Vinod Kone





--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (MESOS-1343) Authorize HTTP endpoints through ACLs

2014-08-06 Thread Joe Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14088432#comment-14088432
 ] 

Joe Smith commented on MESOS-1343:
--

Clarified with [~benjaminhindman]- these are gated via specific http-principals 
only. So a framework will not be able to use its principal to send requests via 
http- however these HTTP endpoints still need to have an ACL applied to them.

 Authorize HTTP endpoints through ACLs
 -

 Key: MESOS-1343
 URL: https://issues.apache.org/jira/browse/MESOS-1343
 Project: Mesos
  Issue Type: Story
Reporter: Vinod Kone





--
This message was sent by Atlassian JIRA
(v6.2#6252)