[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591106#comment-16591106 ] Qian Zhang commented on MESOS-8568: --- RR: https://reviews.apache.org/r/68495/ > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591103#comment-16591103 ] Qian Zhang edited comment on MESOS-8568 at 8/24/18 3:20 AM: [~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server process is launched, it just [waits on a promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181], and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never be made since the check container was failed to launch, so we have to wait 5s for the `SIGTERM`. I am not quite sure which cases that [SIGTERM & 5s timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818] is for, maybe [~jieyu] and [~klueska] have more info? was (Author: qianzhang): [~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server process is launched, it just [waits on a promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181], and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never be made since the check container was failed to launch, so we have to wait 5s for the `SIGTERM`. I am not quite sure which cases that [5s timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818] is for, maybe [~jieyu] and [~klueska] have more info? > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16591103#comment-16591103 ] Qian Zhang commented on MESOS-8568: --- [~vinodkone] Yeah, I noticed that as well. When the I/O switchboard server process is launched, it just [waits on a promise|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L1181], and that promise will only be set when a `ATTACH_CONTAINER_OUTPUT` call is made or a `SIGTERM` is sent. In this case, `ATTACH_CONTAINER_OUTPUT` will never be made since the check container was failed to launch, so we have to wait 5s for the `SIGTERM`. I am not quite sure which cases that [5s timeout|https://github.com/apache/mesos/blob/1.6.1/src/slave/containerizer/mesos/io/switchboard.cpp#L810:L818] is for, maybe [~jieyu] and [~klueska] have more info? > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9181) Fix the comment in JNI libraries regarding weak reference and GC
Vinod Kone created MESOS-9181: - Summary: Fix the comment in JNI libraries regarding weak reference and GC Key: MESOS-9181 URL: https://issues.apache.org/jira/browse/MESOS-9181 Project: Mesos Issue Type: Documentation Reporter: Vinod Kone Our JNI libraries for MesosSchedulerDriver, v0Mesos and v1Mesos all use weak global references to the underlying Java objects, but they incorrectly state that this will prevent JVM from GC'ing it. We need to fix these coments. e.g., [https://github.com/apache/mesos/blob/master/src/java/jni/org_apache_mesos_v1_scheduler_V1Mesos.cpp#L213] See the JNI spec for details: [https://docs.oracle.com/javase/7/docs/technotes/guides/jni/spec/functions.html#weak] -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
[ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589312#comment-16589312 ] Andrei Budnik edited comment on MESOS-9131 at 8/23/18 6:17 PM: --- [Test draft implementation|https://github.com/abudnik/mesos/commit/cf6e8cbc9aff4cdd350c1f13a2a37a3b5bce656e] [Fix draft implementation|https://github.com/abudnik/mesos/commit/65690c8674902cb3ca55a8dddb4e370447856b0f] was (Author: abudnik): [Test draft implementation|https://github.com/abudnik/mesos/commit/cf6e8cbc9aff4cdd350c1f13a2a37a3b5bce656e] [Fix draft implementation|https://github.com/abudnik/mesos/commit/a7b6a7d23e4a190e2d3215c02094c03a7cf72d3a] > Health checks launching nested containers while a container is being > destroyed lead to unkillable tasks > --- > > Key: MESOS-9131 > URL: https://issues.apache.org/jira/browse/MESOS-9131 > Project: Mesos > Issue Type: Bug > Components: agent, containerization >Affects Versions: 1.5.1 >Reporter: Jan Schlicht >Assignee: Qian Zhang >Priority: Blocker > Labels: container-stuck > > A container might get stuck in {{DESTROYING}} state if there's a command > health check that starts new nested containers while its parent container is > getting destroyed. > Here are some logs which unrelated lines removed. The > `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping > afterwards. > {noformat} > 2018-04-16 12:37:54: I0416 12:37:54.235877 3863 containerizer.cpp:2807] > Container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has > exited > 2018-04-16 12:37:54: I0416 12:37:54.235914 3863 containerizer.cpp:2354] > Destroying container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in > RUNNING state > 2018-04-16 12:37:54: I0416 12:37:54.235932 3863 containerizer.cpp:2968] > Transitioning the state of container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 > from RUNNING to DESTROYING > 2018-04-16 12:37:54: I0416 12:37:54.236100 3852 linux_launcher.cpp:514] > Asked to destroy container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.237671 3852 linux_launcher.cpp:560] > Using freezer to destroy cgroup > mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.240327 3852 cgroups.cpp:3060] Freezing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.244179 3852 cgroups.cpp:1415] > Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 3.814144ms > 2018-04-16 12:37:54: I0416 12:37:54.250550 3853 cgroups.cpp:3078] Thawing > cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > 2018-04-16 12:37:54: I0416 12:37:54.256599 3853 cgroups.cpp:1444] > Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 > after 5.977856ms > ... > 2018-04-16 12:37:54: I0416 12:37:54.371117 3837 http.cpp:3502] Processing > LAUNCH_NESTED_CONTAINER_SESSION call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd' > 2018-04-16 12:37:54: W0416 12:37:54.371692 3842 http.cpp:2758] Failed to > launch container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd: > Parent container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is > in 'DESTROYING' state > 2018-04-16 12:37:54: W0416 12:37:54.371826 3840 containerizer.cpp:2337] > Attempted to destroy unknown container > db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd > ... > 2018-04-16 12:37:55: I0416 12:37:55.504456 3856 http.cpp:3078] Processing > REMOVE_NESTED_CONTAINER call for container > 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6' > ... > 2018-04-16 12:37:55: I0416 12:37:55.556367 3857 http.cpp:3502] Processing > LAUNCH_NESTED_CONTAINER_SESSION call for container >
[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.
[ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590554#comment-16590554 ] Benno Evers commented on MESOS-9177: https://reviews.apache.org/r/68484/ > Mesos master segfaults when responding to /state requests. > -- > > Key: MESOS-9177 > URL: https://issues.apache.org/jira/browse/MESOS-9177 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: mesosphere > > {noformat} > *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; > stack trace: *** > @ 0x7f367e7226d0 (unknown) > @ 0x7f3681266913 > _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ > @ 0x7f3681266af0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36812882d0 > mesos::internal::master::FullFrameworkWriter::operator()() > @ 0x7f36812889d0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f368121aef0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f3681241be3 > _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ > @ 0x7f3681242760 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv > @ 0x7f368215f60e process::http::OK::OK() > @ 0x7f3681219061 > _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ > @ 0x7f36812212c0 > _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_ > @ 0x7f36812215ac > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_ > @ 0x7f36821f3541 process::ProcessBase::consume() > @ 0x7f3682209fbc process::ProcessManager::resume() > @ 0x7f368220fa76 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > @ 0x7f367eefc2b0 (unknown) > @ 0x7f367e71ae25 start_thread > @ 0x7f367e444bad __clone > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`
[ https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590513#comment-16590513 ] Vinod Kone commented on MESOS-8568: --- Great repro! One orthogonal question though, it seems unfortunate that IOSwitchboard takes 5s to complete its cleanup for a container that has failed to launch. IIRC there was a 5s timeout in IOSwitchboard for some unexpected corner cases which is what we seem to be hitting here, but this is an *expected* case in some sense. Is there anyway we can speed that up? > Command checks should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER` > -- > > Key: MESOS-8568 > URL: https://issues.apache.org/jira/browse/MESOS-8568 > Project: Mesos > Issue Type: Improvement >Reporter: Andrei Budnik >Assignee: Qian Zhang >Priority: Blocker > Labels: default-executor, health-check, mesosphere > > After successful launch of a nested container via > `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls > [waitNestedContainer > |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657] > for the container. Checker library > [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487] > `REMOVE_NESTED_CONTAINER` to remove a previous nested container before > launching a nested container for a subsequent check. Hence, > `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that > the nested container has been terminated and can be removed/cleaned up. > In case of failure, the library [doesn't > call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636] > `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be > launched and the following attempt to remove the container without call > `WAIT_NESTED_CONTAINER` leads to errors like: > {code:java} > W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal > Server Error' (Nested container has not terminated yet) while removing the > nested container > '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125' > used for the COMMAND check for task > 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91 > {code} > The checker library should always call `WAIT_NESTED_CONTAINER` before > `REMOVE_NESTED_CONTAINER`. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-9177) Mesos master segfaults when responding to /state requests.
[ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590457#comment-16590457 ] Benno Evers edited comment on MESOS-9177 at 8/23/18 4:24 PM: - I'm now able to reliably reproduce the segfault on mesos binariers built against boost 1.53 using the following script: {code} #!/bin/bash WORKDIR1=`mktemp` WORKDIR2=`mktemp` MESOS_BINDIR=/home/bevers/src/mesos/worktrees/master/build/src/ trap "exit" INT TERM trap 'kill $(jobs -p)' EXIT rm -rf $WORKDIR1 $WORKDIR2 $MESOS_BINDIR/mesos-master --work_dir=$WORKDIR1 --ip=127.0.0.1 --port=2323 & sleep 1 $MESOS_BINDIR/mesos-agent --work_dir=$WORKDIR2 --master=127.0.0.1:2323 --no-systemd_enable_support --launcher_dir=$MESOS_BINDIR & sleep 1 $MESOS_BINDIR/mesos-execute --name="silly_task" --master=127.0.0.1:2323 --command="echo hello" sleep 1 for i in `seq 1 80`; do wget 127.0.0.1:2323/state & done sleep 1 {code} was (Author: bennoe): I'm now able to reliably reproduce the segfault on mesos binariers built against boost 1.53 using the following script: {code} #!/bin/bash WORKDIR1=`mktemp` WORKDIR2=`mktemp` MESOS_BINDIR=/home/bevers/src/mesos/worktrees/master/build/src/ trap "exit" INT TERM trap 'kill $(jobs -p)' EXIT rm -rf $WORKDIR1 $WORKDIR2 $MESOS_BINDIR/mesos-master --work_dir=$WORKDIR1 --ip=127.0.0.1 --port=2323 --no-authenticate_frameworks --no-authenticate_http_frameworks --no-authenticate_agents --authorizers="local" & sleep 1 $MESOS_BINDIR/mesos-agent --work_dir=$WORKDIR2 --master=127.0.0.1:2323 --no-systemd_enable_support --launcher_dir=$MESOS_BINDIR & sleep 1 $MESOS_BINDIR/mesos-execute --name="silly_task" --master=127.0.0.1:2323 --command="echo hello" sleep 1 for i in `seq 1 80`; do wget 127.0.0.1:2323/state & done sleep 1 {code} > Mesos master segfaults when responding to /state requests. > -- > > Key: MESOS-9177 > URL: https://issues.apache.org/jira/browse/MESOS-9177 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: mesosphere > > {noformat} > *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; > stack trace: *** > @ 0x7f367e7226d0 (unknown) > @ 0x7f3681266913 > _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ > @ 0x7f3681266af0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36812882d0 > mesos::internal::master::FullFrameworkWriter::operator()() > @ 0x7f36812889d0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f368121aef0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f3681241be3 > _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ > @ 0x7f3681242760 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv > @ 0x7f368215f60e process::http::OK::OK() > @ 0x7f3681219061 > _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ > @ 0x7f36812212c0 >
[jira] [Commented] (MESOS-9177) Mesos master segfaults when responding to /state requests.
[ https://issues.apache.org/jira/browse/MESOS-9177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590457#comment-16590457 ] Benno Evers commented on MESOS-9177: I'm now able to reliably reproduce the segfault on mesos binariers built against boost 1.53 using the following script: {code} #!/bin/bash WORKDIR1=`mktemp` WORKDIR2=`mktemp` MESOS_BINDIR=/home/bevers/src/mesos/worktrees/master/build/src/ trap "exit" INT TERM trap 'kill $(jobs -p)' EXIT rm -rf $WORKDIR1 $WORKDIR2 $MESOS_BINDIR/mesos-master --work_dir=$WORKDIR1 --ip=127.0.0.1 --port=2323 --no-authenticate_frameworks --no-authenticate_http_frameworks --no-authenticate_agents --authorizers="local" & sleep 1 $MESOS_BINDIR/mesos-agent --work_dir=$WORKDIR2 --master=127.0.0.1:2323 --no-systemd_enable_support --launcher_dir=$MESOS_BINDIR & sleep 1 $MESOS_BINDIR/mesos-execute --name="silly_task" --master=127.0.0.1:2323 --command="echo hello" sleep 1 for i in `seq 1 80`; do wget 127.0.0.1:2323/state & done sleep 1 {code} > Mesos master segfaults when responding to /state requests. > -- > > Key: MESOS-9177 > URL: https://issues.apache.org/jira/browse/MESOS-9177 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.7.0 >Reporter: Alexander Rukletsov >Assignee: Benno Evers >Priority: Blocker > Labels: mesosphere > > {noformat} > *** SIGSEGV (@0x8) received by PID 66991 (TID 0x7f36792b7700) from PID 8; > stack trace: *** > @ 0x7f367e7226d0 (unknown) > @ 0x7f3681266913 > _ZZNK5mesos8internal6master19FullFrameworkWriterclEPN4JSON12ObjectWriterEENKUlPNS3_11ArrayWriterEE1_clES7_ > @ 0x7f3681266af0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZNK5mesos8internal6master19FullFrameworkWriterclEPNSA_12ObjectWriterEEUlPNSA_11ArrayWriterEE1_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36812882d0 > mesos::internal::master::FullFrameworkWriter::operator()() > @ 0x7f36812889d0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIN5mesos8internal6master19FullFrameworkWriterEvEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f368121aef0 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_ENKUlPNSA_12ObjectWriterEE_clESU_EUlPNSA_11ArrayWriterEE3_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f3681241be3 > _ZZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNS4_5OwnedINS_15ObjectApprovers_clES8_SD_ENKUlPN4JSON12ObjectWriterEE_clESH_ > @ 0x7f3681242760 > _ZNSt17_Function_handlerIFvPN9rapidjson6WriterINS0_19GenericStringBufferINS0_4UTF8IcEENS0_12CrtAllocatorEEES4_S4_S5_Lj0ZN4JSON8internal7jsonifyIZZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvENKUlRKN7process4http7RequestERKNSI_5OwnedINSD_15ObjectApprovers_clESM_SR_EUlPNSA_12ObjectWriterEE_vEESt8functionIS9_ERKT_NSB_6PreferEEUlS8_E_E9_M_invokeERKSt9_Any_dataS8_ > @ 0x7f36810a41bb _ZNO4JSON5ProxycvSsEv > @ 0x7f368215f60e process::http::OK::OK() > @ 0x7f3681219061 > _ZN7process20AsyncExecutorProcess7executeIZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS_4http7RequestERKNS_5OwnedINS2_15ObjectApprovers_S8_SD_Li0EEENSt9result_ofIFT_T0_T1_EE4typeERKSI_SJ_SK_ > @ 0x7f36812212c0 > _ZZN7process8dispatchINS_4http8ResponseENS_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNS1_7RequestERKNS_5OwnedINS4_15ObjectApprovers_S9_SE_SJ_RS9_RSE_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSQ_FSN_T1_T2_T3_EOT4_OT5_OT6_ENKUlSt10unique_ptrINS_7PromiseIS2_EESt14default_deleteIS17_EEOSH_OS9_OSE_PNS_11ProcessBaseEE_clES1A_S1B_S1C_S1D_S1F_ > @ 0x7f36812215ac > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchINS1_4http8ResponseENS1_20AsyncExecutorProcessERKZN5mesos8internal6master6Master4Http25processStateRequestsBatchEvEUlRKNSA_7RequestERKNS1_5OwnedINSD_15ObjectApprovers_SI_SN_SS_RSI_RSN_EENS1_6FutureIT_EERKNS1_3PIDIT0_EEMSZ_FSW_T1_T2_T3_EOT4_OT5_OT6_EUlSt10unique_ptrINS1_7PromiseISB_EESt14default_deleteIS1G_EEOSQ_OSI_OSN_S3_E_IS1J_SQ_SI_SN_St12_PlaceholderILi1EEclEOS3_ > @ 0x7f36821f3541 process::ProcessBase::consume() > @ 0x7f3682209fbc
[jira] [Commented] (MESOS-9180) tasks get stuck in TASK_KILLING on the default executor
[ https://issues.apache.org/jira/browse/MESOS-9180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16590356#comment-16590356 ] Kirill Plyashkevich commented on MESOS-9180: somewhat related to MESOS-8679, but in this case killing is actually being retried. > tasks get stuck in TASK_KILLING on the default executor > --- > > Key: MESOS-9180 > URL: https://issues.apache.org/jira/browse/MESOS-9180 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.6.1 > Environment: Ubuntu 18.04, Ubuntu 16.04 >Reporter: Kirill Plyashkevich >Priority: Critical > > during our load tests tasks get stuck in TASK_KILLING state > {quote}{noformat} > I0823 16:30:20.367563 21608 executor.cpp:192] Version: 1.6.1 > I0823 16:30:20.439478 21684 default_executor.cpp:202] Received SUBSCRIBED > event > I0823 16:30:20.441012 21684 default_executor.cpp:206] Subscribed executor on > XX.XXX.XX.XXX > I0823 16:30:20.916216 21665 default_executor.cpp:202] Received LAUNCH_GROUP > event > I0823 16:30:20.917373 21645 default_executor.cpp:426] Setting > 'MESOS_CONTAINER_IP' to: 172.26.10.222 > I0823 16:30:22.573794 21658 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:22.575518 21637 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:22.577137 21665 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:33.091509 21642 default_executor.cpp:661] Finished launching > tasks [ > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka, > > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis, > > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery > ] in child containers [ > 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8, > 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7, > 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b ] > I0823 16:30:33.091567 21642 default_executor.cpp:685] Waiting on child > containers of tasks [ > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka, > > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis, > > test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery > ] > I0823 16:30:33.096014 21647 default_executor.cpp:746] Waiting for child > container > 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8 of > task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' > I0823 16:30:33.096310 21647 default_executor.cpp:746] Waiting for child > container > 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7 of > task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' > I0823 16:30:33.096470 21647 default_executor.cpp:746] Waiting for child > container > 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b of > task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery' > I0823 16:30:33.521510 21648 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:33.522073 21652 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:33.523569 21679 default_executor.cpp:202] Received ACKNOWLEDGED > event > I0823 16:30:38.593736 21668 checker_process.cpp:814] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' > (stdout): > 0 > PONG > I0823 16:30:38.593777 21668 checker_process.cpp:817] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' > (stderr): > I0823 16:30:38.610167 21650 checker_process.cpp:814] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' > (stdout): > I0823 16:30:38.610194 21650 checker_process.cpp:817] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' > (stderr): > I0823 16:30:38.700561 21681 checker_process.cpp:814] Output of the COMMAND > health check for task > 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery' > (stdout): > I0823 16:30:38.700598 21681 checker_process.cpp:817] Output of the COMMAND > health check for task >
[jira] [Created] (MESOS-9180) tasks get stuck in TASK_KILLING on the default executor
Kirill Plyashkevich created MESOS-9180: -- Summary: tasks get stuck in TASK_KILLING on the default executor Key: MESOS-9180 URL: https://issues.apache.org/jira/browse/MESOS-9180 Project: Mesos Issue Type: Bug Components: executor Affects Versions: 1.6.1 Environment: Ubuntu 18.04, Ubuntu 16.04 Reporter: Kirill Plyashkevich during our load tests tasks get stuck in TASK_KILLING state {quote}{noformat} I0823 16:30:20.367563 21608 executor.cpp:192] Version: 1.6.1 I0823 16:30:20.439478 21684 default_executor.cpp:202] Received SUBSCRIBED event I0823 16:30:20.441012 21684 default_executor.cpp:206] Subscribed executor on XX.XXX.XX.XXX I0823 16:30:20.916216 21665 default_executor.cpp:202] Received LAUNCH_GROUP event I0823 16:30:20.917373 21645 default_executor.cpp:426] Setting 'MESOS_CONTAINER_IP' to: 172.26.10.222 I0823 16:30:22.573794 21658 default_executor.cpp:202] Received ACKNOWLEDGED event I0823 16:30:22.575518 21637 default_executor.cpp:202] Received ACKNOWLEDGED event I0823 16:30:22.577137 21665 default_executor.cpp:202] Received ACKNOWLEDGED event I0823 16:30:33.091509 21642 default_executor.cpp:661] Finished launching tasks [ test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka, test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis, test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery ] in child containers [ 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8, 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7, 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b ] I0823 16:30:33.091567 21642 default_executor.cpp:685] Waiting on child containers of tasks [ test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka, test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis, test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery ] I0823 16:30:33.096014 21647 default_executor.cpp:746] Waiting for child container 3680beff-96d2-4ebd-832c-9cbbddf8c507.8e04f74f-cb8b-46b9-8758-340455a844c8 of task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' I0823 16:30:33.096310 21647 default_executor.cpp:746] Waiting for child container 3680beff-96d2-4ebd-832c-9cbbddf8c507.fc60bf0f-5814-4ea9-a37f-89ebe3e2f5f7 of task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' I0823 16:30:33.096470 21647 default_executor.cpp:746] Waiting for child container 3680beff-96d2-4ebd-832c-9cbbddf8c507.ab481072-c8ab-4a76-be8b-7f4431220e7b of task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery' I0823 16:30:33.521510 21648 default_executor.cpp:202] Received ACKNOWLEDGED event I0823 16:30:33.522073 21652 default_executor.cpp:202] Received ACKNOWLEDGED event I0823 16:30:33.523569 21679 default_executor.cpp:202] Received ACKNOWLEDGED event I0823 16:30:38.593736 21668 checker_process.cpp:814] Output of the COMMAND health check for task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' (stdout): 0 PONG I0823 16:30:38.593777 21668 checker_process.cpp:817] Output of the COMMAND health check for task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' (stderr): I0823 16:30:38.610167 21650 checker_process.cpp:814] Output of the COMMAND health check for task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' (stdout): I0823 16:30:38.610194 21650 checker_process.cpp:817] Output of the COMMAND health check for task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.akka' (stderr): I0823 16:30:38.700561 21681 checker_process.cpp:814] Output of the COMMAND health check for task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery' (stdout): I0823 16:30:38.700598 21681 checker_process.cpp:817] Output of the COMMAND health check for task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.delivery' (stderr): I0823 16:30:42.786908 21649 checker_process.cpp:971] COMMAND health check for task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis' returned: 0 I0823 16:30:42.787267 21649 default_executor.cpp:1375] Received task health update for task 'test_cb88dd0c-a6e0-11e8-888f-fb74b926ae8c.instance-08d37bd7-a6e1-11e8-9e12-0242e3789894.redis', task is healthy I0823 16:30:45.156363 21658 default_executor.cpp:202] Received
[jira] [Commented] (MESOS-9174) Unexpected containers transition from RUNNING to DESTROYING during recovery
[ https://issues.apache.org/jira/browse/MESOS-9174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16589939#comment-16589939 ] Stephan Erb commented on MESOS-9174: I have run a few more experiments: *Broken setup*: Containers get terminated on agent restarts * Default setup with new systemd: ** systemd 237-3~bpo9+1 with options {{Delegate=true}} and {{KillMode=control-group}} ** Mesos 1.6.1 with option {{--systemd_enable_support}} *Working setups*: Containers survive agent restarts * Default setup with old systemd: ** systemd 232-25+deb9u4 with options {{Delegate=true}} and {{KillMode=control-group}} ** Mesos 1.6.1 with option {{--systemd_enable_support}} * New systemd with disabled cgroup interference ** systemd 237-3~bpo9+1 with options {{Delegate=true}} and {{KillMode=process}} ** Mesos 1.6.1 with option {{--no-systemd_enable_support}} For now, we will ensure that we just run older systemd version across our fleets as a workaround. > Unexpected containers transition from RUNNING to DESTROYING during recovery > --- > > Key: MESOS-9174 > URL: https://issues.apache.org/jira/browse/MESOS-9174 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.5.0, 1.6.1 >Reporter: Stephan Erb >Priority: Major > Attachments: mesos-agent.log, mesos-executor-stderr.log > > > I am trying to hunt down a weird issue where sometimes restarting a Mesos > agent takes down all Mesos containers. The containers die without an apparent > cause: > {code} > I0821 13:35:01.486346 61392 linux_launcher.cpp:360] Recovered container > 02da7be0-271e-449f-9554-dc776adb29a9 > I0821 13:35:03.627367 61362 provisioner.cpp:451] Recovered container > 02da7be0-271e-449f-9554-dc776adb29a9 > I0821 13:35:03.701448 61375 containerizer.cpp:2835] Container > 02da7be0-271e-449f-9554-dc776adb29a9 has exited > I0821 13:35:03.701453 61375 containerizer.cpp:2382] Destroying container > 02da7be0-271e-449f-9554-dc776adb29a9 in RUNNING state > I0821 13:35:03.701457 61375 containerizer.cpp:2996] Transitioning the state > of container 02da7be0-271e-449f-9554-dc776adb29a9 from RUNNING to DESTROYING > {code} > From the perspective of the executor, there is nothing relevant in the logs. > Everything just stops directly as if the container gets terminated externally > without notifying the executor first. For further details, please see the > attached agent log and one (example) executor log file. > I am aware that this is a long shot, but anyone an idea what I should be > looking at to narrow down the issue? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9179) ./support/python3/mesos-gtest-runner.py --help crashes
Armand Grillet created MESOS-9179: - Summary: ./support/python3/mesos-gtest-runner.py --help crashes Key: MESOS-9179 URL: https://issues.apache.org/jira/browse/MESOS-9179 Project: Mesos Issue Type: Bug Reporter: Armand Grillet Assignee: Armand Grillet {noformat} $ ./support/python3/mesos-gtest-runner.py --help Traceback (most recent call last): File "./support/python3/mesos-gtest-runner.py", line 196, in EXECUTABLE, OPTIONS = parse_arguments() File "./support/python3/mesos-gtest-runner.py", line 108, in parse_arguments .format(default_=DEFAULT_NUM_JOBS)) File "/usr/lib64/python3.5/argparse.py", line 1335, in add_argument raise ValueError('%r is not callable' % (type_func,)) ValueError: 'int' is not callable {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)