[jira] [Commented] (MESOS-9162) Unkillable pod container stuck in ISOLATING

2018-08-29 Thread A. Dukhovniy (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16596432#comment-16596432
 ] 

A. Dukhovniy commented on MESOS-9162:
-

We had another manifestation of the unkillable pod task, however in a different 
test. There is one thing that those tests all do - 
1. they start  a pod
2. kill it (either on the agent directly or through the Marathon)
3. restart the pod

The test **never fails in step1** but *always in step 3*. Also, all tests use 
the same pod definition **which uses volume**.

I'll not attach new logs since they symptoms (container stuck in ISOLATING) 
look the same.
/cc [~abudnik] [~gilbert] [~jieyu]

> Unkillable pod container stuck in ISOLATING
> ---
>
> Key: MESOS-9162
> URL: https://issues.apache.org/jira/browse/MESOS-9162
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.6.0, 1.7.0
>Reporter: A. Dukhovniy
>Assignee: Gilbert Song
>Priority: Major
>  Labels: container-stuck
> Attachments: dcos-marathon.service.log, dcos-mesos-master.service.gz, 
> dcos-mesos-slave.service.gz, diagnostics.zip, 
> sandbox_10_10_0_222_var_lib.tar.gz
>
>
> We have a simple test that launches a pod with two containers (one writes in 
> a file and the other reads it). This test is flaky because the container 
> sometimes fails to start.
> Marathon app definition:
> {code:java}
> {
>   "id": "/simple-pod",
>   "scaling": {
> "kind": "fixed",
> "instances": 1
>   },
>   "environment": {
> "PING": "PONG"
>   },
>   "containers": [
> {
>   "name": "ct1",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "image": {
> "kind": "DOCKER",
> "id": "busybox"
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo the current time is $(date) > 
> ./test-v1/clock; sleep 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "test-v1"
> }
>   ]
> },
> {
>   "name": "ct2",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 
> 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "etc"
> },
> {
>   "name": "v2",
>   "mountPath": "docker"
> }
>   ]
> }
>   ],
>   "networks": [
> {
>   "mode": "host"
> }
>   ],
>   "volumes": [
> {
>   "name": "v1"
> },
> {
>   "name": "v2",
>   "host": "/var/lib/docker"
> }
>   ]
> }
> {code}
> During the test, Marathon tries to launch the pod but doesn't receive a 
> {{TASK_RUNNING}} for the first container and so after 2min decides to kill 
> the pod which also fails. 
> Agent sandbox (attached to this ticket minus docker layers, since they're too 
> big to attach) shows that one of the containers wasn't started properly - the 
> last line in the agent log says:
> {code}
> Transitioning the state of container 
> ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 
> from PREPARING to ISOLATING
> {code}
> Until then the log looks pretty unspektakular. 
> Afterwards, Marathon tries to kill the container repeatedly, but doesn't 
> succeed - the executor receives the reuests but doesn't send anything back:
> {code}
> I0816 22:52:53.111995 4 default_executor.cpp:204] Received SUBSCRIBED 
> event
> I0816 22:52:53.112520 4 default_executor.cpp:208] Subscribed executor on 
> 10.10.0.222
> I0816 22:52:53.112783 4 default_executor.cpp:204] Received LAUNCH_GROUP 
> event
> I0816 22:52:53.11651611 default_executor.cpp:428] Setting 
> 'MESOS_CONTAINER_IP' to: 10.10.0.222
> I0816 22:52:53.169596 4 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:52:53.19441610 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:54:50.559470 8 default_executor.cpp:204] Received KILL event
> I0816 22:54:50.559496 8 default_executor.cpp:1251] Received kill for task 
> 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1'
> I0816 22:54:50.559737 4 default_executor.cpp:204] Received KILL event
> I0816 22:54:50.559751 4 default_executor.cpp:1251] Received kill for task 
> 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct2'
> ...
> {code}
> Relevant Ids for grepping the logs:
> {code}
> Marathon app id: /simple-pod-bcc8f180b611494aa972520b8b650ca9
> Mesos tasks id: 
> 

[jira] [Commented] (MESOS-9162) Unkillable pod container stuck in ISOLATING

2018-08-17 Thread A. Dukhovniy (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16583940#comment-16583940
 ] 

A. Dukhovniy commented on MESOS-9162:
-

And the complete diagnostics bundle, just in case. [^diagnostics.zip] 

> Unkillable pod container stuck in ISOLATING
> ---
>
> Key: MESOS-9162
> URL: https://issues.apache.org/jira/browse/MESOS-9162
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.6.0, 1.7.0
>Reporter: A. Dukhovniy
>Priority: Major
> Attachments: dcos-marathon.service.log, dcos-mesos-master.service.gz, 
> dcos-mesos-slave.service.gz, diagnostics.zip, 
> sandbox_10_10_0_222_var_lib.tar.gz
>
>
> We have a simple test that launches a pod with two containers (one writes in 
> a file and the other reads it). This test is flaky because the container 
> sometimes fails to start.
> Marathon app definition:
> {code:java}
> {
>   "id": "/simple-pod",
>   "scaling": {
> "kind": "fixed",
> "instances": 1
>   },
>   "environment": {
> "PING": "PONG"
>   },
>   "containers": [
> {
>   "name": "ct1",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "image": {
> "kind": "DOCKER",
> "id": "busybox"
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo the current time is $(date) > 
> ./test-v1/clock; sleep 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "test-v1"
> }
>   ]
> },
> {
>   "name": "ct2",
>   "resources": {
> "cpus": 0.1,
> "mem": 32
>   },
>   "exec": {
> "command": {
>   "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 
> 1; done"
> }
>   },
>   "volumeMounts": [
> {
>   "name": "v1",
>   "mountPath": "etc"
> },
> {
>   "name": "v2",
>   "mountPath": "docker"
> }
>   ]
> }
>   ],
>   "networks": [
> {
>   "mode": "host"
> }
>   ],
>   "volumes": [
> {
>   "name": "v1"
> },
> {
>   "name": "v2",
>   "host": "/var/lib/docker"
> }
>   ]
> }
> {code}
> During the test, Marathon tries to launch the pod but doesn't receive a 
> {{TASK_RUNNING}} for the first container and so after 2min decides to kill 
> the pod which also fails. 
> Agent sandbox (attached to this ticket minus docker layers, since they're too 
> big to attach) shows that one of the containers wasn't started properly - the 
> last line in the agent log says:
> {code}
> Transitioning the state of container 
> ff4f4fdc-9327-42fb-be40-29e919e15aee.e9b05652-e779-46f8-9b76-b2e1ce7e5940 
> from PREPARING to ISOLATING
> {code}
> Until then the log looks pretty unspektakular. 
> Afterwards, Marathon tries to kill the container repeatedly, but doesn't 
> succeed - the executor receives the reuests but doesn't send anything back:
> {code}
> I0816 22:52:53.111995 4 default_executor.cpp:204] Received SUBSCRIBED 
> event
> I0816 22:52:53.112520 4 default_executor.cpp:208] Subscribed executor on 
> 10.10.0.222
> I0816 22:52:53.112783 4 default_executor.cpp:204] Received LAUNCH_GROUP 
> event
> I0816 22:52:53.11651611 default_executor.cpp:428] Setting 
> 'MESOS_CONTAINER_IP' to: 10.10.0.222
> I0816 22:52:53.169596 4 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:52:53.19441610 default_executor.cpp:204] Received ACKNOWLEDGED 
> event
> I0816 22:54:50.559470 8 default_executor.cpp:204] Received KILL event
> I0816 22:54:50.559496 8 default_executor.cpp:1251] Received kill for task 
> 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1'
> I0816 22:54:50.559737 4 default_executor.cpp:204] Received KILL event
> I0816 22:54:50.559751 4 default_executor.cpp:1251] Received kill for task 
> 'simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct2'
> ...
> {code}
> Relevant Ids for grepping the logs:
> {code}
> Marathon app id: /simple-pod-bcc8f180b611494aa972520b8b650ca9
> Mesos tasks id: 
> simple-pod-bcc8f180b611494aa972520b8b650ca9.instance-1ad9ecbb-a1a7-11e8-b35a-6e17842c13e2.ct1
> Mesos container id: e9b05652-e779-46f8-9b76-b2e1ce7e5940
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)