[jira] [Created] (AURORA-1800) Support Mesos Maintenance primitives

2016-10-18 Thread Ankit Khera (JIRA)
Ankit Khera created AURORA-1800:
---

 Summary: Support Mesos Maintenance primitives
 Key: AURORA-1800
 URL: https://issues.apache.org/jira/browse/AURORA-1800
 Project: Aurora
  Issue Type: Story
  Components: Maintenance
Reporter: Ankit Khera


Support Mesos Maintenance primitives

Mesos 0.25.0 introduced the notion of maintenance primitives using which 
operators can post maintenance schedule for machines.  

More details here : http://mesos.apache.org/documentation/latest/maintenance/

This request to have aurora start using these primitives and drain machines in 
an SLA aware manner. 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1798) resolv.conf is not copied when using the Mesos containerizer with a Docker image

2016-10-18 Thread Justin Pinkul (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15587070#comment-15587070
 ] 

Justin Pinkul commented on AURORA-1798:
---

Review: https://reviews.apache.org/r/53003/

> resolv.conf is not copied when using the Mesos containerizer with a Docker 
> image
> 
>
> Key: AURORA-1798
> URL: https://issues.apache.org/jira/browse/AURORA-1798
> Project: Aurora
>  Issue Type: Bug
>  Components: Executor
>Affects Versions: 0.16.0
>Reporter: Justin Pinkul
>Assignee: Justin Pinkul
>
> When Thermos launches a task using a Docker image it mounts the image as a 
> volume and manually chroots into it. One consequence of this is the logic 
> inside of the {{network/cni}} isolator that copies {{resolv.conf}} from the 
> host into the new rootfs is bypassed. The Thermos executor should manually 
> copy this file into the rootfs until Mesos pod support is implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1799) Thermos does not handle low memory scenarios gracefully

2016-10-18 Thread Zameer Manji (JIRA)
Zameer Manji created AURORA-1799:


 Summary: Thermos does not handle low memory scenarios gracefully
 Key: AURORA-1799
 URL: https://issues.apache.org/jira/browse/AURORA-1799
 Project: Aurora
  Issue Type: Bug
Reporter: Zameer Manji


Background:
In an environment where Aurora is used to launch Docker containers via the 
DockerContainerizer, it was observed that some tasks would not be killed.

What happened is that a task was allocated with a low amount of memory but 
demanded a lot. This caused the linux OOM killer to be invoked. Unlike the 
MesosContainerizer, the agent doesn't tear down the container when the OOM 
killer is invoked. Instead the OOM killer just kills a process in the container 
and thermos and mesos are unaware (unless a process directly launched by 
thermos is killed).

I observed in the scheduler logs that the scheduler was trying to kill a 
container every reconciliation period but it never died. The slave had the logs 
indicating it received the killTask RPC and forwarded it to Thermos.

The thermos logs had several entries like every hour:
{noformat}
I1018 20:39:18.102894 6 executor_base.py:45] Executor 
[aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: Activating kill manager.
I1018 20:39:18.103034 6 executor_base.py:45] Executor 
[aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: killTask returned.
I1018 21:39:17.859935 6 executor_base.py:45] Executor 
[aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: killTask got task_id: value: 
""
{noformat}

However, the tasks was never killed. Looking at the stderr of thermos I saw the 
following entries:
{noformat}
Logged from file resource.py, line 155
Traceback (most recent call last):
  File "/usr/lib/python2.7/logging/__init__.py", line 883, in emit
self.flush()
  File "/usr/lib/python2.7/logging/__init__.py", line 843, in flush
self.stream.flush()
IOError: [Errno 12] Cannot allocate memory
{noformat}

and 
{noformat}
Logged from file thermos_task_runner.py, line 171
Traceback (most recent call last):
  File 
"/root/.pex/install/twitter.common.exceptions-0.3.3-py2-none-any.whl.2a67b833b1517d179ef1c8dc6f2dac1023d51e3c/twitter.common.exceptions-0.3.3-py2-none-any.whl/twitter/common/exceptions/__init__.py",
 line 126, in _excepting_run

  File "apache/aurora/executor/status_manager.py", line 47, in run
  File "apache/aurora/executor/common/status_checker.py", line 97, in status
  File "apache/aurora/executor/thermos_task_runner.py", line 358, in status
  File "apache/aurora/executor/thermos_task_runner.py", line 186, in 
compute_status
  File "apache/aurora/executor/thermos_task_runner.py", line 136, in task_state
  File "apache/thermos/monitoring/monitor.py", line 118, in task_state
  File "apache/thermos/monitoring/monitor.py", line 114, in get_state
  File "apache/thermos/monitoring/monitor.py", line 77, in _apply_states
  File 
"/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py",
 line 182, in try_read
class InvalidTypeException(Error): pass
  File 
"/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py",
 line 168, in read
return RecordIO.Reader.do_read(self._fp, self._codec)
  File 
"/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py",
 line 135, in do_read
header = fp.read(RecordIO.RECORD_HEADER_SIZE)
  File 
"/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/filelike.py",
 line 81, in read
return self._fp.read(length)
IOError: [Errno 12] Cannot allocate memory
{noformat}

It seems the regular avenues of reading checkpoints or logging data, thermos 
would get an IOError. Some part of twitter common installs an excepthook to log 
the exception, but we don't seem to do anything else.

I think we should probably install our own exception hook to send a 
{{LOST_TASK}} with the exception information instead of failing to kill the 
task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (AURORA-1797) Add full support for ACI containers

2016-10-18 Thread Joshua Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/AURORA-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15585602#comment-15585602
 ] 

Joshua Cohen commented on AURORA-1797:
--

What happened here is that Mesos did not have support for fetching AppC images 
from a registry when the support was first added to Aurora. Now that Mesos 
supports AppC simple discovery, we should update Aurora as well.

> Add full support for ACI containers
> ---
>
> Key: AURORA-1797
> URL: https://issues.apache.org/jira/browse/AURORA-1797
> Project: Aurora
>  Issue Type: Story
>Reporter: Thomas Bach
>
> For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
> {{arch}} labels in the image description. The relevant code for this can be 
> found here: 
> https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61
> At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
> attributes. These are sufficient for the tests in 
> https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
>  to pass properly. Here the fetcher is never invoked because the directory 
> structure is laid out in such a way that Mesos finds the image in its cache.
> At the moment it is possible to work around this issue by giving Mesos the 
> additional information via the {{default_container_info}} argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (AURORA-1797) Add full support for ACI containers

2016-10-18 Thread Joshua Cohen (JIRA)

 [ 
https://issues.apache.org/jira/browse/AURORA-1797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joshua Cohen updated AURORA-1797:
-
Description: 
For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
{{arch}} labels in the image description. The relevant code for this can be 
found here: 
https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61

At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
attributes. These are sufficient for the tests in 
https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
 to pass properly. Here the fetcher is never invoked because the directory 
structure is laid out in such a way that Mesos finds the image in its cache.

At the moment it is possible to work around this issue by giving Mesos the 
additional information via the {{default_container_info}} argument.

  was:
For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
{{arch}} labels in the image description. The relevant code for this can be 
found here: 
https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61

At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
attributes. These are sufficient for the tests in 
https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
 to pass properly. Here the fetcher is never invoked because the directory 
structure is laid out in such a way that Mesos finds the image in its cache.

At the moment it is possible to work around this issue by giving Mesos the 
additional information via the {{default_container_info argument}}.


> Add full support for ACI containers
> ---
>
> Key: AURORA-1797
> URL: https://issues.apache.org/jira/browse/AURORA-1797
> Project: Aurora
>  Issue Type: Story
>Reporter: Thomas Bach
>
> For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
> {{arch}} labels in the image description. The relevant code for this can be 
> found here: 
> https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61
> At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
> attributes. These are sufficient for the tests in 
> https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
>  to pass properly. Here the fetcher is never invoked because the directory 
> structure is laid out in such a way that Mesos finds the image in its cache.
> At the moment it is possible to work around this issue by giving Mesos the 
> additional information via the {{default_container_info}} argument.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (AURORA-1797) Add full support for ACI containers

2016-10-18 Thread Thomas Bach (JIRA)
Thomas Bach created AURORA-1797:
---

 Summary: Add full support for ACI containers
 Key: AURORA-1797
 URL: https://issues.apache.org/jira/browse/AURORA-1797
 Project: Aurora
  Issue Type: Story
Reporter: Thomas Bach


For {{AppcImage}} to work properly the Mesos fetcher needs the {{os}} and 
{{arch}} labels in the image description. The relevant code for this can be 
found here: 
https://github.com/apache/mesos/blob/171d214afa92cce56d8e1c350d6b2968887e6f15/src/slave/containerizer/mesos/provisioner/appc/fetcher.cpp#L61

At the moment {{AppcImage}} only supports {{name}} and {{image_id}} as 
attributes. These are sufficient for the tests in 
https://github.com/apache/aurora/blob/master/src/test/sh/org/apache/aurora/e2e/test_end_to_end.sh#L427
 to pass properly. Here the fetcher is never invoked because the directory 
structure is laid out in such a way that Mesos finds the image in its cache.

At the moment it is possible to work around this issue by giving Mesos the 
additional information via the {{default_container_info argument}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)