[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-23 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591106#comment-16591106
 ] 

Qian Zhang commented on MESOS-8568:
---

RR: https://reviews.apache.org/r/68495/

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-23 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591103#comment-16591103
 ] 

Qian Zhang edited comment on MESOS-8568 at 8/24/18 3:20 AM:


[~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server 
process is launched, it just [waits on a 
promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181],
 and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is 
made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never 
be made since the check container was failed to launch, so we have to wait 5s 
for the `SIGTERM`.

I am not quite sure which cases that [SIGTERM & 5s 
timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818]
 is for, maybe [~jieyu] and [~klueska] have more info?


was (Author: qianzhang):
[~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server 
process is launched, it just [waits on a 
promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181],
 and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is 
made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never 
be made since the check container was failed to launch, so we have to wait 5s 
for the `SIGTERM`.

I am not quite sure which cases that [5s 
timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818]
 is for, maybe [~jieyu] and [~klueska] have more info?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-23 Thread Qian Zhang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591103#comment-16591103
 ] 

Qian Zhang commented on MESOS-8568:
---

[~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server 
process is launched, it just [waits on a 
promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181],
 and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is 
made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never 
be made since the check container was failed to launch, so we have to wait 5s 
for the `SIGTERM`.

I am not quite sure which cases that [5s 
timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818]
 is for, maybe [~jieyu] and [~klueska] have more info?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9181) Fix the comment in JNI libraries regarding weak reference and GC

2018-08-23 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-9181:
-

 Summary: Fix the comment in JNI libraries regarding weak reference 
and GC
 Key: MESOS-9181
 URL: https://issues.apache.org/jira/browse/MESOS-9181
 Project: Mesos
  Issue Type: Documentation
Reporter: Vinod Kone


Our JNI libraries for MesosSchedulerDriver, v0Mesos and v1Mesos all use weak 
global references to the underlying Java objects, but they incorrectly state 
that this will prevent JVM from GC'ing it. We need to fix these coments.

e.g., 
[https://github.com/apache/mesos/blob/master/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L213]

 

See the JNI spec for details: 
[https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#weak]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

2018-08-23 Thread Andrei Budnik (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589312#comment-16589312
 ] 

Andrei Budnik edited comment on MESOS-9131 at 8/23/18 6:17 PM:
---

[Test draft 
implementation|https://github.com/abudnik/mesos/commit/cf6e8cbc9aff4cdd350c1f13a2a37a3b5bce656e]

[Fix draft 
implementation|https://github.com/abudnik/mesos/commit/65690c8674902cb3ca55a8dddb4e370447856b0f]


was (Author: abudnik):
[Test draft 
implementation|https://github.com/abudnik/mesos/commit/cf6e8cbc9aff4cdd350c1f13a2a37a3b5bce656e]

[Fix draft 
implementation|https://github.com/abudnik/mesos/commit/a7b6a7d23e4a190e2d3215c02094c03a7cf72d3a]

> Health checks launching nested containers while a container is being 
> destroyed lead to unkillable tasks
> ---
>
> Key: MESOS-9131
> URL: https://issues.apache.org/jira/browse/MESOS-9131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization
>Affects Versions: 1.5.1
>Reporter: Jan Schlicht
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: container-stuck
>
> A container might get stuck in {{DESTROYING}} state if there's a command 
> health check that starts new nested containers while its parent container is 
> getting destroyed.
> Here are some logs which unrelated lines removed. The 
> `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping 
> afterwards.
> {noformat}
> 2018-04-16 12:37:54: I0416 12:37:54.235877  3863 containerizer.cpp:2807] 
> Container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has 
> exited
> 2018-04-16 12:37:54: I0416 12:37:54.235914  3863 containerizer.cpp:2354] 
> Destroying container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in 
> RUNNING state
> 2018-04-16 12:37:54: I0416 12:37:54.235932  3863 containerizer.cpp:2968] 
> Transitioning the state of container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 
> from RUNNING to DESTROYING
> 2018-04-16 12:37:54: I0416 12:37:54.236100  3852 linux_launcher.cpp:514] 
> Asked to destroy container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.237671  3852 linux_launcher.cpp:560] 
> Using freezer to destroy cgroup 
> mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.240327  3852 cgroups.cpp:3060] Freezing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.244179  3852 cgroups.cpp:1415] 
> Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 3.814144ms
> 2018-04-16 12:37:54: I0416 12:37:54.250550  3853 cgroups.cpp:3078] Thawing 
> cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.256599  3853 cgroups.cpp:1444] 
> Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
>  after 5.977856ms
> ...
> 2018-04-16 12:37:54: I0416 12:37:54.371117  3837 http.cpp:3502] Processing 
> LAUNCH_NESTED_CONTAINER_SESSION call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd'
> 2018-04-16 12:37:54: W0416 12:37:54.371692  3842 http.cpp:2758] Failed to 
> launch container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd:
>  Parent container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is 
> in 'DESTROYING' state
> 2018-04-16 12:37:54: W0416 12:37:54.371826  3840 containerizer.cpp:2337] 
> Attempted to destroy unknown container 
> db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.504456  3856 http.cpp:3078] Processing 
> REMOVE_NESTED_CONTAINER call for container 
> 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6'
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.556367  3857 http.cpp:3502] Processing 
> LAUNCH_NESTED_CONTAINER_SESSION call for container 
> 

[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-23 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590554#comment-16590554
 ] 

Benno Evers commented on MESOS-9177:


https://reviews.apache.org/r/68484/

> Mesos master segfaults when responding to /state requests.
> --
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @ 0x7f367e7226d0 (unknown)
>  @ 0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @ 0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @ 0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f368121aef0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f3681241be3 
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @ 0x7f3681242760 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @ 0x7f368215f60e process::http::OK::OK()
>  @ 0x7f3681219061 
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @ 0x7f36812212c0 
> _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
>  @ 0x7f36812215ac 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_
>  @ 0x7f36821f3541 process::ProcessBase::consume()
>  @ 0x7f3682209fbc process::ProcessManager::resume()
>  @ 0x7f368220fa76 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
>  @ 0x7f367eefc2b0 (unknown)
>  @ 0x7f367e71ae25 start_thread
>  @ 0x7f367e444bad __clone
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-08-23 Thread Vinod Kone (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590513#comment-16590513
 ] 

Vinod Kone commented on MESOS-8568:
---

Great repro!

One orthogonal question though, it seems unfortunate that IOSwitchboard takes 
5s to complete its cleanup for a container that has failed to launch. IIRC 
there was a 5s timeout in IOSwitchboard for some unexpected corner cases which 
is what we seem to be hitting here, but this is an *expected* case in some 
sense.  Is there anyway we can speed that up?

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Andrei Budnik
>Assignee: Qian Zhang
>Priority: Blocker
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-23 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590457#comment-16590457
 ] 

Benno Evers edited comment on MESOS-9177 at 8/23/18 4:24 PM:
-

I'm now able to reliably reproduce the segfault on mesos binariers built 
against boost 1.53 using the following script:

{code}
#!/bin/bash

WORKDIR1=`mktemp`
WORKDIR2=`mktemp`
MESOS_BINDIR=/home/bevers/src/mesos/worktrees/master/build/src/

trap "exit" INT TERM
trap 'kill $(jobs -p)' EXIT

rm -rf $WORKDIR1 $WORKDIR2
$MESOS_BINDIR/mesos-master --work_dir=$WORKDIR1 --ip=127.0.0.1 --port=2323 &
sleep 1

$MESOS_BINDIR/mesos-agent --work_dir=$WORKDIR2 --master=127.0.0.1:2323 
--no-systemd_enable_support --launcher_dir=$MESOS_BINDIR &
sleep 1

$MESOS_BINDIR/mesos-execute --name="silly_task" --master=127.0.0.1:2323 
--command="echo hello"
sleep 1

for i in `seq 1 80`;
do
  wget 127.0.0.1:2323/state  &
done

sleep 1
{code}


was (Author: bennoe):
I'm now able to reliably reproduce the segfault on mesos binariers built 
against boost 1.53 using the following script:

{code}
#!/bin/bash

WORKDIR1=`mktemp`
WORKDIR2=`mktemp`
MESOS_BINDIR=/home/bevers/src/mesos/worktrees/master/build/src/

trap "exit" INT TERM
trap 'kill $(jobs -p)' EXIT

rm -rf $WORKDIR1 $WORKDIR2
$MESOS_BINDIR/mesos-master --work_dir=$WORKDIR1 --ip=127.0.0.1 --port=2323 
--no-authenticate_frameworks --no-authenticate_http_frameworks 
--no-authenticate_agents --authorizers="local" &
sleep 1

$MESOS_BINDIR/mesos-agent --work_dir=$WORKDIR2 --master=127.0.0.1:2323 
--no-systemd_enable_support --launcher_dir=$MESOS_BINDIR &
sleep 1

$MESOS_BINDIR/mesos-execute --name="silly_task" --master=127.0.0.1:2323 
--command="echo hello"
sleep 1

for i in `seq 1 80`;
do
  wget 127.0.0.1:2323/state  &
done

sleep 1
{code}

> Mesos master segfaults when responding to /state requests.
> --
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @ 0x7f367e7226d0 (unknown)
>  @ 0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @ 0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @ 0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f368121aef0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f3681241be3 
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @ 0x7f3681242760 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @ 0x7f368215f60e process::http::OK::OK()
>  @ 0x7f3681219061 
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @ 0x7f36812212c0 
> 

[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.

2018-08-23 Thread Benno Evers (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590457#comment-16590457
 ] 

Benno Evers commented on MESOS-9177:


I'm now able to reliably reproduce the segfault on mesos binariers built 
against boost 1.53 using the following script:

{code}
#!/bin/bash

WORKDIR1=`mktemp`
WORKDIR2=`mktemp`
MESOS_BINDIR=/home/bevers/src/mesos/worktrees/master/build/src/

trap "exit" INT TERM
trap 'kill $(jobs -p)' EXIT

rm -rf $WORKDIR1 $WORKDIR2
$MESOS_BINDIR/mesos-master --work_dir=$WORKDIR1 --ip=127.0.0.1 --port=2323 
--no-authenticate_frameworks --no-authenticate_http_frameworks 
--no-authenticate_agents --authorizers="local" &
sleep 1

$MESOS_BINDIR/mesos-agent --work_dir=$WORKDIR2 --master=127.0.0.1:2323 
--no-systemd_enable_support --launcher_dir=$MESOS_BINDIR &
sleep 1

$MESOS_BINDIR/mesos-execute --name="silly_task" --master=127.0.0.1:2323 
--command="echo hello"
sleep 1

for i in `seq 1 80`;
do
  wget 127.0.0.1:2323/state  &
done

sleep 1
{code}

> Mesos master segfaults when responding to /state requests.
> --
>
> Key: MESOS-9177
> URL: https://issues.apache.org/jira/browse/MESOS-9177
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.7.0
>Reporter: Alexander Rukletsov
>Assignee: Benno Evers
>Priority: Blocker
>  Labels: mesosphere
>
> {noformat}
>  *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; 
> stack trace: ***
>  @ 0x7f367e7226d0 (unknown)
>  @ 0x7f3681266913 
> _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_
>  @ 0x7f3681266af0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36812882d0 
> mesos::internal::master::FullFrameworkWriter::operator()()
>  @ 0x7f36812889d0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f368121aef0 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f3681241be3 
> _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_
>  @ 0x7f3681242760 
> _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_
>  @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv
>  @ 0x7f368215f60e process::http::OK::OK()
>  @ 0x7f3681219061 
> _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_
>  @ 0x7f36812212c0 
> _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_
>  @ 0x7f36812215ac 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_
>  @ 0x7f36821f3541 process::ProcessBase::consume()
>  @ 0x7f3682209fbc 

[jira] [Commented] (MESOS-9180) tasks get stuck in TASK_KILLING on the default executor

2018-08-23 Thread Kirill Plyashkevich (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590356#comment-16590356
 ] 

Kirill Plyashkevich commented on MESOS-9180:


somewhat related to MESOS-8679, but in this case killing is actually being 
retried.

> tasks get stuck in TASK_KILLING on the default executor
> ---
>
> Key: MESOS-9180
> URL: https://issues.apache.org/jira/browse/MESOS-9180
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.6.1
> Environment: Ubuntu 18.04, Ubuntu 16.04
>Reporter: Kirill Plyashkevich
>Priority: Critical
>
> during our load tests tasks get stuck in TASK_KILLING state
> {quote}{noformat}
> I0823 16:30:20.367563 21608 executor.cpp:192] Version: 1.6.1
> I0823 16:30:20.439478 21684 default_executor.cpp:202] Received SUBSCRIBED 
> event
> I0823 16:30:20.441012 21684 default_executor.cpp:206] Subscribed executor on 
> XX.XXX.XX.XXX
> I0823 16:30:20.916216 21665 default_executor.cpp:202] Received LAUNCH_GROUP 
> event
> I0823 16:30:20.917373 21645 default_executor.cpp:426] Setting 
> 'MESOS_CONTAINER_IP' to: 172.26.10.222
> I0823 16:30:22.573794 21658 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:22.575518 21637 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:22.577137 21665 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.091509 21642 default_executor.cpp:661] Finished launching 
> tasks [ 
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery
>  ] in child containers [ 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8, 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7, 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b ]
> I0823 16:30:33.091567 21642 default_executor.cpp:685] Waiting on child 
> containers of tasks [ 
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis,
>  
> test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery
>  ]
> I0823 16:30:33.096014 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8 of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
> I0823 16:30:33.096310 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7 of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
> I0823 16:30:33.096470 21647 default_executor.cpp:746] Waiting for child 
> container 
> 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b of 
> task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
> I0823 16:30:33.521510 21648 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.522073 21652 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:33.523569 21679 default_executor.cpp:202] Received ACKNOWLEDGED 
> event
> I0823 16:30:38.593736 21668 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
>  (stdout):
> 0
> PONG
> I0823 16:30:38.593777 21668 checker_process.cpp:817] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
>  (stderr):
> I0823 16:30:38.610167 21650 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
>  (stdout):
> I0823 16:30:38.610194 21650 checker_process.cpp:817] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
>  (stderr):
> I0823 16:30:38.700561 21681 checker_process.cpp:814] Output of the COMMAND 
> health check for task 
> 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
>  (stdout):
> I0823 16:30:38.700598 21681 checker_process.cpp:817] Output of the COMMAND 
> health check for task 
> 

[jira] [Created] (MESOS-9180) tasks get stuck in TASK_KILLING on the default executor

2018-08-23 Thread Kirill Plyashkevich (JIRA)
Kirill Plyashkevich created MESOS-9180:
--

 Summary: tasks get stuck in TASK_KILLING on the default executor
 Key: MESOS-9180
 URL: https://issues.apache.org/jira/browse/MESOS-9180
 Project: Mesos
  Issue Type: Bug
  Components: executor
Affects Versions: 1.6.1
 Environment: Ubuntu 18.04, Ubuntu 16.04
Reporter: Kirill Plyashkevich


during our load tests tasks get stuck in TASK_KILLING state
{quote}{noformat}
I0823 16:30:20.367563 21608 executor.cpp:192] Version: 1.6.1
I0823 16:30:20.439478 21684 default_executor.cpp:202] Received SUBSCRIBED event
I0823 16:30:20.441012 21684 default_executor.cpp:206] Subscribed executor on 
XX.XXX.XX.XXX
I0823 16:30:20.916216 21665 default_executor.cpp:202] Received LAUNCH_GROUP 
event
I0823 16:30:20.917373 21645 default_executor.cpp:426] Setting 
'MESOS_CONTAINER_IP' to: 172.26.10.222
I0823 16:30:22.573794 21658 default_executor.cpp:202] Received ACKNOWLEDGED 
event
I0823 16:30:22.575518 21637 default_executor.cpp:202] Received ACKNOWLEDGED 
event
I0823 16:30:22.577137 21665 default_executor.cpp:202] Received ACKNOWLEDGED 
event
I0823 16:30:33.091509 21642 default_executor.cpp:661] Finished launching tasks 
[ 
test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka,
 
test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis,
 
test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery
 ] in child containers [ 
3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8, 
3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7, 
3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b ]
I0823 16:30:33.091567 21642 default_executor.cpp:685] Waiting on child 
containers of tasks [ 
test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka,
 
test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis,
 
test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery
 ]
I0823 16:30:33.096014 21647 default_executor.cpp:746] Waiting for child 
container 
3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8 of 
task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
I0823 16:30:33.096310 21647 default_executor.cpp:746] Waiting for child 
container 
3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7 of 
task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
I0823 16:30:33.096470 21647 default_executor.cpp:746] Waiting for child 
container 
3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b of 
task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
I0823 16:30:33.521510 21648 default_executor.cpp:202] Received ACKNOWLEDGED 
event
I0823 16:30:33.522073 21652 default_executor.cpp:202] Received ACKNOWLEDGED 
event
I0823 16:30:33.523569 21679 default_executor.cpp:202] Received ACKNOWLEDGED 
event
I0823 16:30:38.593736 21668 checker_process.cpp:814] Output of the COMMAND 
health check for task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
 (stdout):
0
PONG
I0823 16:30:38.593777 21668 checker_process.cpp:817] Output of the COMMAND 
health check for task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
 (stderr):
I0823 16:30:38.610167 21650 checker_process.cpp:814] Output of the COMMAND 
health check for task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
 (stdout):
I0823 16:30:38.610194 21650 checker_process.cpp:817] Output of the COMMAND 
health check for task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka'
 (stderr):
I0823 16:30:38.700561 21681 checker_process.cpp:814] Output of the COMMAND 
health check for task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
 (stdout):
I0823 16:30:38.700598 21681 checker_process.cpp:817] Output of the COMMAND 
health check for task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery'
 (stderr):
I0823 16:30:42.786908 21649 checker_process.cpp:971] COMMAND health check for 
task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis'
 returned: 0
I0823 16:30:42.787267 21649 default_executor.cpp:1375] Received task health 
update for task 
'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis',
 task is healthy
I0823 16:30:45.156363 21658 default_executor.cpp:202] Received 

[jira] [Commented] (MESOS-9174) Unexpected containers transition from RUNNING to DESTROYING during recovery

2018-08-23 Thread Stephan Erb (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589939#comment-16589939
 ] 

Stephan Erb commented on MESOS-9174:


I have run a few more experiments:

*Broken setup*: Containers get terminated on agent restarts

* Default setup with new systemd: 
** systemd 237-3~bpo9+1 with options {{Delegate=true}} and 
{{KillMode=control-group}}
** Mesos 1.6.1 with option {{--systemd_enable_support}} 

*Working setups*: Containers survive agent restarts

* Default setup with old systemd:
** systemd 232-25+deb9u4 with options {{Delegate=true}} and 
{{KillMode=control-group}}
** Mesos 1.6.1 with option {{--systemd_enable_support}} 

* New systemd with disabled cgroup interference
** systemd 237-3~bpo9+1 with options {{Delegate=true}} and {{KillMode=process}}
** Mesos 1.6.1 with option {{--no-systemd_enable_support}}

For now, we will ensure that we just run older systemd version across our 
fleets as a workaround.

> Unexpected containers transition from RUNNING to DESTROYING during recovery
> ---
>
> Key: MESOS-9174
> URL: https://issues.apache.org/jira/browse/MESOS-9174
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.5.0, 1.6.1
>Reporter: Stephan Erb
>Priority: Major
> Attachments: mesos-agent.log, mesos-executor-stderr.log
>
>
> I am trying to hunt down a weird issue where sometimes restarting a Mesos 
> agent takes down all Mesos containers. The containers die without an apparent 
> cause:
> {code}
> I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container 
> 02da7be0-271e-449f-9554-dc776adb29a9
> I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container 
> 02da7be0-271e-449f-9554-dc776adb29a9 has exited
> I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container 
> 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state
> I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state 
> of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING
> {code}
> From the perspective of the executor, there is nothing relevant in the logs. 
> Everything just stops directly as if the container gets terminated externally 
> without notifying the executor first. For further details, please see the 
> attached agent log and one (example) executor log file.
> I am aware that this is a long shot, but anyone an idea what I should be 
> looking at to narrow down the issue?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-9179) ./support/python3/mesos-gtest-runner.py --help crashes

2018-08-23 Thread Armand Grillet (JIRA)
Armand Grillet created MESOS-9179:
-

 Summary: ./support/python3/mesos-gtest-runner.py --help crashes
 Key: MESOS-9179
 URL: https://issues.apache.org/jira/browse/MESOS-9179
 Project: Mesos
  Issue Type: Bug
Reporter: Armand Grillet
Assignee: Armand Grillet


{noformat}
$ ./support/python3/mesos-gtest-runner.py --help
Traceback (most recent call last):
  File "./support/python3/mesos-gtest-runner.py", line 196, in 
EXECUTABLE, OPTIONS = parse_arguments()
  File "./support/python3/mesos-gtest-runner.py", line 108, in parse_arguments
.format(default_=DEFAULT_NUM_JOBS))
  File "/usr/lib64/python3.5/argparse.py", line 1335, in add_argument
raise ValueError('%r is not callable' % (type_func,))
ValueError: 'int' is not callable
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)