[jira] [Commented] (MESOS-7975) The command/default/docker executor can incorrectly send a TASK_FINISHED update even when the task is killed
[ https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16192306#comment-16192306 ] Qian Zhang commented on MESOS-7975: --- [~vinodkone] Sure, done. > The command/default/docker executor can incorrectly send a TASK_FINISHED > update even when the task is killed > > > Key: MESOS-7975 > URL: https://issues.apache.org/jira/browse/MESOS-7975 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Qian Zhang >Priority: Critical > Labels: mesosphere > > Currently, when a task is killed, the default/command/docker executor > incorrectly send a {{TASK_FINISHED}} status update instead of > {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when > the task exits with a zero status code. > {code} > if (WSUCCEEDED(status)) { > taskState = TASK_FINISHED; > } else if (killed) { > // Send TASK_KILLED if the task was killed as a result of > // kill() or shutdown(). > taskState = TASK_KILLED; > } else { > taskState = TASK_FAILED; > } > {code} > We should modify the code to correctly send {{TASK_KILLED}} status updates > when a task is killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (MESOS-7975) The command/default/docker executor can incorrectly send a TASK_FINISHED update even when the task is killed
[ https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16166060#comment-16166060 ] Qian Zhang edited comment on MESOS-7975 at 10/5/17 12:44 AM: - RR: https://reviews.apache.org/r/62685/ https://reviews.apache.org/r/62326/ https://reviews.apache.org/r/62327/ https://reviews.apache.org/r/62774/ https://reviews.apache.org/r/62775/ was (Author: qianzhang): RR: https://reviews.apache.org/r/62685/ https://reviews.apache.org/r/62326/ https://reviews.apache.org/r/62327/ > The command/default/docker executor can incorrectly send a TASK_FINISHED > update even when the task is killed > > > Key: MESOS-7975 > URL: https://issues.apache.org/jira/browse/MESOS-7975 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Qian Zhang >Priority: Critical > Labels: mesosphere > > Currently, when a task is killed, the default/command/docker executor > incorrectly send a {{TASK_FINISHED}} status update instead of > {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when > the task exits with a zero status code. > {code} > if (WSUCCEEDED(status)) { > taskState = TASK_FINISHED; > } else if (killed) { > // Send TASK_KILLED if the task was killed as a result of > // kill() or shutdown(). > taskState = TASK_KILLED; > } else { > taskState = TASK_FAILED; > } > {code} > We should modify the code to correctly send {{TASK_KILLED}} status updates > when a task is killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7975) The command/default/docker executor can incorrectly send a TASK_FINISHED update even when the task is killed
[ https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Qian Zhang updated MESOS-7975: -- Story Points: 3 > The command/default/docker executor can incorrectly send a TASK_FINISHED > update even when the task is killed > > > Key: MESOS-7975 > URL: https://issues.apache.org/jira/browse/MESOS-7975 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Qian Zhang >Priority: Critical > Labels: mesosphere > > Currently, when a task is killed, the default/command/docker executor > incorrectly send a {{TASK_FINISHED}} status update instead of > {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when > the task exits with a zero status code. > {code} > if (WSUCCEEDED(status)) { > taskState = TASK_FINISHED; > } else if (killed) { > // Send TASK_KILLED if the task was killed as a result of > // kill() or shutdown(). > taskState = TASK_KILLED; > } else { > taskState = TASK_FAILED; > } > {code} > We should modify the code to correctly send {{TASK_KILLED}} status updates > when a task is killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8052) "protoc" not found when running "make -j4 check" directly in stout
[ https://issues.apache.org/jira/browse/MESOS-8052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8052: --- Shepherd: Benjamin Bannier > "protoc" not found when running "make -j4 check" directly in stout > -- > > Key: MESOS-8052 > URL: https://issues.apache.org/jira/browse/MESOS-8052 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: compile-error > Fix For: 1.4.1 > > > If we run {{make -j4 check}} without running {{make}} first, we will get the > following error message: > {noformat} > 3rdparty/protobuf-3.3.0/src/protoc -I../tests --cpp_out=. > ../tests/protobuf_tests.proto > /bin/bash: 3rdparty/protobuf-3.3.0/src/protoc: No such file or directory > Makefile:1934: recipe for target 'protobuf_tests.pb.cc' failed > make: *** [protobuf_tests.pb.cc] Error 127 > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-8051: - Assignee: Qian Zhang [~qianzhang] Can you look into this? > Killing TASK_GROUP fail to kill some tasks > -- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Assignee: Qian Zhang >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210932 4753 master.cpp:52
[jira] [Updated] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2
[ https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-7130: -- Story Points: 2 > port_mapping isolator: executor hangs when running on EC2 > - > > Key: MESOS-7130 > URL: https://issues.apache.org/jira/browse/MESOS-7130 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Pierre Cheynier >Assignee: Jie Yu > > Hi, > I'm experiencing a weird issue: I'm using a CI to do testing on > infrastructure automation. > I recently activated the {{network/port_mapping}} isolator. > I'm able to make the changes work and pass the test for bare-metal servers > and virtualbox VMs using this configuration. > But when I try on EC2 (on which my CI pipeline rely) it systematically fails > to run any container. > It appears that the sandbox is created and the port_mapping isolator seems to > be OK according to the logs in stdout and stderr and the {tc} output : > {noformat} > + mount --make-rslave /run/netns > + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6 > + echo 1 > + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up > + ethtool -K eth0 rx off > (...) > + tc filter show dev eth0 parent :0 > + tc filter show dev lo parent :0 > I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2 > {noformat} > Then the executor never come back in REGISTERED state and hang indefinitely. > {GLOG_v=3} doesn't help here. > My skills in this area are limited, but trying to load the symbols and attach > a gdb to the mesos-executor process, I'm able to print this stack: > {noformat} > #0 0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x7feffbed69ec in > std::condition_variable::wait(std::unique_lock&) () from > /usr/lib64/libstdc++.so.6 > #2 0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from > /usr/lib64/libmesos-1.0.2.so > #3 0x7ff0017d595d in Gate::arrive(long) () from > /usr/lib64/libmesos-1.0.2.so > #4 0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) > () from /usr/lib64/libmesos-1.0.2.so > #5 0x7ff0017c5c05 in process::wait(process::UPID const&, Duration > const&) () from /usr/lib64/libmesos-1.0.2.so > #6 0x004ab26f in process::wait(process::ProcessBase const*, Duration > const&) () > #7 0x004a3903 in main () > {noformat} > I concluded that the underlying shell script launched by the isolator or the > task itself is just .. blocked. But I don't understand why. > Here is a process tree to show that I've no task running but the executor is: > {noformat} > root 28420 0.8 3.0 1061420 124940 ? Ssl 17:56 0:25 > /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 > --attributes=platform:centos;platform_major_version:7;type:base > --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup > --cgroups_net_cls_primary_handle=0xC370 > --container_logger=org_apache_mesos_LogrotateContainerLogger > --containerizers=mesos,docker > --credential=file:///etc/mesos-chef/slave-credential > --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]} > --default_role=default --docker_registry=/usr/share/mesos/users > --docker_store_dir=/var/opt/mesos/store/docker > --egress_unique_flow_per_container --enforce_container_disk_quota > --ephemeral_ports_per_container=128 > --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"} > --image_providers=docker --image_provisioner_backend=copy > --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping > --logging_level=INFO > --master=zk://mesos:test@localhost.localdomain:2181/mesos > --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 > --recover=reconnect > --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict > --work_dir=/var/opt/mesos > root 28484 0.0 2.3 433676 95016 ?Ssl 17:56 0:00 \_ > mesos-logrotate-logger --help=false > --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout > --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB > root 28485 0.0 2.3 499212 94724 ?Ssl 17:56 0:00 \_ > mesos-logrotate-logger --help=false > --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/st
[jira] [Updated] (MESOS-8052) "protoc" not found when running "make -j4 check" directly in stout
[ https://issues.apache.org/jira/browse/MESOS-8052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8052: --- Description: If we run {{make -j4 check}} without running {{make}} first, we will get the following error message: {noformat} 3rdparty/protobuf-3.3.0/src/protoc -I../tests --cpp_out=. ../tests/protobuf_tests.proto /bin/bash: 3rdparty/protobuf-3.3.0/src/protoc: No such file or directory Makefile:1934: recipe for target 'protobuf_tests.pb.cc' failed make: *** [protobuf_tests.pb.cc] Error 127 {noformat} was: +underlined text+If we run {{make tests}} without running {{make}} first, {{tests/protobuf_tests.proto}} would not be compiled, and thus the generated files would be missing: {noformat} g++ -DPACKAGE_NAME=\"stout\" -DPACKAGE_TARNAME=\"stout\" -DPACKAGE_VERSION=\"0.1.0\" -DPACKAGE_STRING=\"stout\ 0.1.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"stout\" -DVERSION=\"0.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBDL=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_CXX11=1 -I. -I.. -I../include -isystem 3rdparty/boost-1.53.0 -I3rdparty/elfio-3.2 -I3rdparty/glog-0.3.3/src -I3rdparty/googletest-release-1.8.0/googlemock/include -I3rdparty/googletest-release-1.8.0/googletest/include -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS -I3rdparty/picojson-1.3.0 -I3rdparty/protobuf-3.3.0/src -I/usr/include/subversion-1 -I/usr/include/apr-1 -I/usr/include/apr-1.0 -Wall -Wsign-compare -Wformat-security -fstack-protector-strong -fPIC -fPIE -g1 -O0 -Wno-unused-local-typedefs -std=c++11 -MT stout_tests-protobuf_tests.o -MD -MP -MF .deps/stout_tests-protobuf_tests.Tpo -c -o stout_tests-protobuf_tests.o `test -f 'tests/protobuf_tests.cpp' || echo '../'`tests/protobuf_tests.cpp ../tests/protobuf_tests.cpp:28:31: fatal error: protobuf_tests.pb.h: No such file or directory compilation terminated. Makefile:1278: recipe for target 'stout_tests-protobuf_tests.o' failed make[1]: *** [stout_tests-protobuf_tests.o] Error 1 {noformat} > "protoc" not found when running "make -j4 check" directly in stout > -- > > Key: MESOS-8052 > URL: https://issues.apache.org/jira/browse/MESOS-8052 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: compile-error > Fix For: 1.4.1 > > > If we run {{make -j4 check}} without running {{make}} first, we will get the > following error message: > {noformat} > 3rdparty/protobuf-3.3.0/src/protoc -I../tests --cpp_out=. > ../tests/protobuf_tests.proto > /bin/bash: 3rdparty/protobuf-3.3.0/src/protoc: No such file or directory > Makefile:1934: recipe for target 'protobuf_tests.pb.cc' failed > make: *** [protobuf_tests.pb.cc] Error 127 > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8052) "protoc" not found when running "make -j4 check" directly in stout
[ https://issues.apache.org/jira/browse/MESOS-8052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8052: --- Summary: "protoc" not found when running "make -j4 check" directly in stout (was: "protobuf_tests.pb.h" not found when running "make tests" directly in stout) > "protoc" not found when running "make -j4 check" directly in stout > -- > > Key: MESOS-8052 > URL: https://issues.apache.org/jira/browse/MESOS-8052 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: compile-error > Fix For: 1.4.1 > > > +underlined text+If we run {{make tests}} without running {{make}} first, > {{tests/protobuf_tests.proto}} would not be compiled, and thus the generated > files would be missing: > {noformat} > g++ -DPACKAGE_NAME=\"stout\" -DPACKAGE_TARNAME=\"stout\" > -DPACKAGE_VERSION=\"0.1.0\" -DPACKAGE_STRING=\"stout\ 0.1.0\" > -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"stout\" > -DVERSION=\"0.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 > -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 > -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 > -DLT_OBJDIR=\".libs/\" -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 > -DHAVE_LIBDL=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 > -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_SVN_VERSION_H=1 > -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_CXX11=1 -I. -I.. -I../include -isystem > 3rdparty/boost-1.53.0 -I3rdparty/elfio-3.2 -I3rdparty/glog-0.3.3/src > -I3rdparty/googletest-release-1.8.0/googlemock/include > -I3rdparty/googletest-release-1.8.0/googletest/include -DPICOJSON_USE_INT64 > -D__STDC_FORMAT_MACROS -I3rdparty/picojson-1.3.0 > -I3rdparty/protobuf-3.3.0/src -I/usr/include/subversion-1 > -I/usr/include/apr-1 -I/usr/include/apr-1.0 -Wall -Wsign-compare > -Wformat-security -fstack-protector-strong -fPIC -fPIE -g1 -O0 > -Wno-unused-local-typedefs -std=c++11 -MT stout_tests-protobuf_tests.o -MD > -MP -MF .deps/stout_tests-protobuf_tests.Tpo -c -o > stout_tests-protobuf_tests.o `test -f 'tests/protobuf_tests.cpp' || echo > '../'`tests/protobuf_tests.cpp > ../tests/protobuf_tests.cpp:28:31: fatal error: protobuf_tests.pb.h: No such > file or directory > compilation terminated. > Makefile:1278: recipe for target 'stout_tests-protobuf_tests.o' failed > make[1]: *** [stout_tests-protobuf_tests.o] Error 1 > {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8052) "protobuf_tests.pb.h" not found when running "make tests" directly in stout
[ https://issues.apache.org/jira/browse/MESOS-8052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chun-Hung Hsiao updated MESOS-8052: --- Description: +underlined text+If we run {{make tests}} without running {{make}} first, {{tests/protobuf_tests.proto}} would not be compiled, and thus the generated files would be missing: {noformat} g++ -DPACKAGE_NAME=\"stout\" -DPACKAGE_TARNAME=\"stout\" -DPACKAGE_VERSION=\"0.1.0\" -DPACKAGE_STRING=\"stout\ 0.1.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"stout\" -DVERSION=\"0.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBDL=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_CXX11=1 -I. -I.. -I../include -isystem 3rdparty/boost-1.53.0 -I3rdparty/elfio-3.2 -I3rdparty/glog-0.3.3/src -I3rdparty/googletest-release-1.8.0/googlemock/include -I3rdparty/googletest-release-1.8.0/googletest/include -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS -I3rdparty/picojson-1.3.0 -I3rdparty/protobuf-3.3.0/src -I/usr/include/subversion-1 -I/usr/include/apr-1 -I/usr/include/apr-1.0 -Wall -Wsign-compare -Wformat-security -fstack-protector-strong -fPIC -fPIE -g1 -O0 -Wno-unused-local-typedefs -std=c++11 -MT stout_tests-protobuf_tests.o -MD -MP -MF .deps/stout_tests-protobuf_tests.Tpo -c -o stout_tests-protobuf_tests.o `test -f 'tests/protobuf_tests.cpp' || echo '../'`tests/protobuf_tests.cpp ../tests/protobuf_tests.cpp:28:31: fatal error: protobuf_tests.pb.h: No such file or directory compilation terminated. Makefile:1278: recipe for target 'stout_tests-protobuf_tests.o' failed make[1]: *** [stout_tests-protobuf_tests.o] Error 1 {noformat} was: If we run {{make tests}} without running {{make}} first, {{tests/protobuf_tests.proto}} would not be compiled, and thus the generated files would be missing: {noformat} g++ -DPACKAGE_NAME=\"stout\" -DPACKAGE_TARNAME=\"stout\" -DPACKAGE_VERSION=\"0.1.0\" -DPACKAGE_STRING=\"stout\ 0.1.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"stout\" -DVERSION=\"0.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBDL=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_CXX11=1 -I. -I.. -I../include -isystem 3rdparty/boost-1.53.0 -I3rdparty/elfio-3.2 -I3rdparty/glog-0.3.3/src -I3rdparty/googletest-release-1.8.0/googlemock/include -I3rdparty/googletest-release-1.8.0/googletest/include -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS -I3rdparty/picojson-1.3.0 -I3rdparty/protobuf-3.3.0/src -I/usr/include/subversion-1 -I/usr/include/apr-1 -I/usr/include/apr-1.0 -Wall -Wsign-compare -Wformat-security -fstack-protector-strong -fPIC -fPIE -g1 -O0 -Wno-unused-local-typedefs -std=c++11 -MT stout_tests-protobuf_tests.o -MD -MP -MF .deps/stout_tests-protobuf_tests.Tpo -c -o stout_tests-protobuf_tests.o `test -f 'tests/protobuf_tests.cpp' || echo '../'`tests/protobuf_tests.cpp ../tests/protobuf_tests.cpp:28:31: fatal error: protobuf_tests.pb.h: No such file or directory compilation terminated. Makefile:1278: recipe for target 'stout_tests-protobuf_tests.o' failed make[1]: *** [stout_tests-protobuf_tests.o] Error 1 {noformat} > "protobuf_tests.pb.h" not found when running "make tests" directly in stout > --- > > Key: MESOS-8052 > URL: https://issues.apache.org/jira/browse/MESOS-8052 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: Chun-Hung Hsiao >Assignee: Chun-Hung Hsiao > Labels: compile-error > Fix For: 1.4.1 > > > +underlined text+If we run {{make tests}} without running {{make}} first, > {{tests/protobuf_tests.proto}} would not be compiled, and thus the generated > files would be missing: > {noformat} > g++ -DPACKAGE_NAME=\"stout\" -DPACKAGE_TARNAME=\"stout\" > -DPACKAGE_VERSION=\"0.1.0\" -DPACKAGE_STRING=\"stout\ 0.1.0\" > -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"stout\" > -DVERSION=\"0.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 > -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 > -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 > -DLT_OBJDIR=\".libs/\" -DHAVE_PTHREAD_PRIO_INH
[jira] [Commented] (MESOS-6240) Allow executor/agent communication over non-TCP/IP stream socket.
[ https://issues.apache.org/jira/browse/MESOS-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191759#comment-16191759 ] Aaron Wood commented on MESOS-6240: --- +1 to what [~zhitao] said! > Allow executor/agent communication over non-TCP/IP stream socket. > - > > Key: MESOS-6240 > URL: https://issues.apache.org/jira/browse/MESOS-6240 > Project: Mesos > Issue Type: Improvement > Components: containerization > Environment: Linux and Windows >Reporter: Avinash Sridharan >Assignee: Benjamin Hindman >Priority: Critical > Labels: mesosphere > > Currently, the executor agent communication happens specifically over TCP > sockets. This works fine in most cases, but specifically for the > `MesosContainerizer` when containers are running on CNI networks, this mode > of communication starts imposing constraints on the CNI network. Since, now > there has to connectivity between the CNI network (on which the executor is > running) and the agent. Introducing paths from a CNI network to the > underlying agent, at best, creates headaches for operators and at worst > introduces serious security holes in the network, since it is breaking the > isolation between the container CNI network and the host network (on which > the agent is running). > In order to simplify/strengthen deployment of Mesos containers on CNI > networks we therefore need to move away from using TCP/IP sockets for > executor/agent communication. Since, executor and agent are guaranteed to run > on the same host, the above problems can be resolved if, for the > `MesosContainerizer`, we use UNIX domain sockets or named pipes instead of > TCP/IP sockets for the executor/agent communication. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (MESOS-8052) "protobuf_tests.pb.h" not found when running "make tests" directly in stout
Chun-Hung Hsiao created MESOS-8052: -- Summary: "protobuf_tests.pb.h" not found when running "make tests" directly in stout Key: MESOS-8052 URL: https://issues.apache.org/jira/browse/MESOS-8052 Project: Mesos Issue Type: Bug Components: stout Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao Fix For: 1.4.1 If we run {{make tests}} without running {{make}} first, {{tests/protobuf_tests.proto}} would not be compiled, and thus the generated files would be missing: {noformat} g++ -DPACKAGE_NAME=\"stout\" -DPACKAGE_TARNAME=\"stout\" -DPACKAGE_VERSION=\"0.1.0\" -DPACKAGE_STRING=\"stout\ 0.1.0\" -DPACKAGE_BUGREPORT=\"\" -DPACKAGE_URL=\"\" -DPACKAGE=\"stout\" -DVERSION=\"0.1.0\" -DSTDC_HEADERS=1 -DHAVE_SYS_TYPES_H=1 -DHAVE_SYS_STAT_H=1 -DHAVE_STDLIB_H=1 -DHAVE_STRING_H=1 -DHAVE_MEMORY_H=1 -DHAVE_STRINGS_H=1 -DHAVE_INTTYPES_H=1 -DHAVE_STDINT_H=1 -DHAVE_UNISTD_H=1 -DHAVE_DLFCN_H=1 -DLT_OBJDIR=\".libs/\" -DHAVE_PTHREAD_PRIO_INHERIT=1 -DHAVE_PTHREAD=1 -DHAVE_LIBDL=1 -DHAVE_FTS_H=1 -DHAVE_APR_POOLS_H=1 -DHAVE_LIBAPR_1=1 -DHAVE_SVN_DELTA_H=1 -DHAVE_LIBSVN_DELTA_1=1 -DHAVE_SVN_VERSION_H=1 -DHAVE_LIBSVN_SUBR_1=1 -DHAVE_CXX11=1 -I. -I.. -I../include -isystem 3rdparty/boost-1.53.0 -I3rdparty/elfio-3.2 -I3rdparty/glog-0.3.3/src -I3rdparty/googletest-release-1.8.0/googlemock/include -I3rdparty/googletest-release-1.8.0/googletest/include -DPICOJSON_USE_INT64 -D__STDC_FORMAT_MACROS -I3rdparty/picojson-1.3.0 -I3rdparty/protobuf-3.3.0/src -I/usr/include/subversion-1 -I/usr/include/apr-1 -I/usr/include/apr-1.0 -Wall -Wsign-compare -Wformat-security -fstack-protector-strong -fPIC -fPIE -g1 -O0 -Wno-unused-local-typedefs -std=c++11 -MT stout_tests-protobuf_tests.o -MD -MP -MF .deps/stout_tests-protobuf_tests.Tpo -c -o stout_tests-protobuf_tests.o `test -f 'tests/protobuf_tests.cpp' || echo '../'`tests/protobuf_tests.cpp ../tests/protobuf_tests.cpp:28:31: fatal error: protobuf_tests.pb.h: No such file or directory compilation terminated. Makefile:1278: recipe for target 'stout_tests-protobuf_tests.o' failed make[1]: *** [stout_tests-protobuf_tests.o] Error 1 {noformat} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-6240) Allow executor/agent communication over non-TCP/IP stream socket.
[ https://issues.apache.org/jira/browse/MESOS-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191610#comment-16191610 ] Zhitao Li commented on MESOS-6240: -- +1 Taking out executor to agent API from TCP to domain socket will also reduce some potential security exposure of agent. Is there a design doc for this work? > Allow executor/agent communication over non-TCP/IP stream socket. > - > > Key: MESOS-6240 > URL: https://issues.apache.org/jira/browse/MESOS-6240 > Project: Mesos > Issue Type: Improvement > Components: containerization > Environment: Linux and Windows >Reporter: Avinash Sridharan >Assignee: Benjamin Hindman >Priority: Critical > Labels: mesosphere > > Currently, the executor agent communication happens specifically over TCP > sockets. This works fine in most cases, but specifically for the > `MesosContainerizer` when containers are running on CNI networks, this mode > of communication starts imposing constraints on the CNI network. Since, now > there has to connectivity between the CNI network (on which the executor is > running) and the agent. Introducing paths from a CNI network to the > underlying agent, at best, creates headaches for operators and at worst > introduces serious security holes in the network, since it is breaking the > isolation between the container CNI network and the host network (on which > the agent is running). > In order to simplify/strengthen deployment of Mesos containers on CNI > networks we therefore need to move away from using TCP/IP sockets for > executor/agent communication. Since, executor and agent are guaranteed to run > on the same host, the above problems can be resolved if, for the > `MesosContainerizer`, we use UNIX domain sockets or named pipes instead of > TCP/IP sockets for the executor/agent communication. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (MESOS-6240) Allow executor/agent communication over non-TCP/IP stream socket.
[ https://issues.apache.org/jira/browse/MESOS-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan reassigned MESOS-6240: Assignee: Benjamin Hindman > Allow executor/agent communication over non-TCP/IP stream socket. > - > > Key: MESOS-6240 > URL: https://issues.apache.org/jira/browse/MESOS-6240 > Project: Mesos > Issue Type: Improvement > Components: containerization > Environment: Linux and Windows >Reporter: Avinash Sridharan >Assignee: Benjamin Hindman >Priority: Critical > Labels: mesosphere > > Currently, the executor agent communication happens specifically over TCP > sockets. This works fine in most cases, but specifically for the > `MesosContainerizer` when containers are running on CNI networks, this mode > of communication starts imposing constraints on the CNI network. Since, now > there has to connectivity between the CNI network (on which the executor is > running) and the agent. Introducing paths from a CNI network to the > underlying agent, at best, creates headaches for operators and at worst > introduces serious security holes in the network, since it is breaking the > isolation between the container CNI network and the host network (on which > the agent is running). > In order to simplify/strengthen deployment of Mesos containers on CNI > networks we therefore need to move away from using TCP/IP sockets for > executor/agent communication. Since, executor and agent are guaranteed to run > on the same host, the above problems can be resolved if, for the > `MesosContainerizer`, we use UNIX domain sockets or named pipes instead of > TCP/IP sockets for the executor/agent communication. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-6240) Allow executor/agent communication over non-TCP/IP stream socket.
[ https://issues.apache.org/jira/browse/MESOS-6240?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan updated MESOS-6240: - Target Version/s: 1.5.0 > Allow executor/agent communication over non-TCP/IP stream socket. > - > > Key: MESOS-6240 > URL: https://issues.apache.org/jira/browse/MESOS-6240 > Project: Mesos > Issue Type: Improvement > Components: containerization > Environment: Linux and Windows >Reporter: Avinash Sridharan >Assignee: Benjamin Hindman >Priority: Critical > Labels: mesosphere > > Currently, the executor agent communication happens specifically over TCP > sockets. This works fine in most cases, but specifically for the > `MesosContainerizer` when containers are running on CNI networks, this mode > of communication starts imposing constraints on the CNI network. Since, now > there has to connectivity between the CNI network (on which the executor is > running) and the agent. Introducing paths from a CNI network to the > underlying agent, at best, creates headaches for operators and at worst > introduces serious security holes in the network, since it is breaking the > isolation between the container CNI network and the host network (on which > the agent is running). > In order to simplify/strengthen deployment of Mesos containers on CNI > networks we therefore need to move away from using TCP/IP sockets for > executor/agent communication. Since, executor and agent are guaranteed to run > on the same host, the above problems can be resolved if, for the > `MesosContainerizer`, we use UNIX domain sockets or named pipes instead of > TCP/IP sockets for the executor/agent communication. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7951) Extend the KillPolicy
[ https://issues.apache.org/jira/browse/MESOS-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7951: -- Sprint: Mesosphere Sprint 63, Mesosphere Sprint 64 (was: Mesosphere Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 65) > Extend the KillPolicy > - > > Key: MESOS-7951 > URL: https://issues.apache.org/jira/browse/MESOS-7951 > Project: Mesos > Issue Type: Improvement > Components: agent, executor, HTTP API >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > > After introducing the {{KillPolicy}} in MESOS-4909, some interactions with > framework developers have led to the suggestion of a couple possible > improvements to this interface. Namely, > * Allowing the framework to specify a command to be run to initiate > termination, rather than a signal to be sent, would allow some developers to > avoid wrapping their application in a signal handler. This is useful because > a signal handler wrapper modifies the application's process tree, which may > make introspection and debugging more difficult in the case of well-known > services with standard debugging procedures. > * In the case of terminations which do begin with a signal, it would be > useful to allow the framework to specify the signal to be sent, rather than > assuming SIGTERM. PostgreSQL, for example, permits several shutdown types, > each initiated with a [different > signal|https://www.postgresql.org/docs/9.3/static/server-shutdown.html]. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7951) Extend the KillPolicy
[ https://issues.apache.org/jira/browse/MESOS-7951?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-7951: -- Sprint: Mesosphere Sprint 63, Mesosphere Sprint 64, Mesosphere Sprint 66 (was: Mesosphere Sprint 63, Mesosphere Sprint 64) > Extend the KillPolicy > - > > Key: MESOS-7951 > URL: https://issues.apache.org/jira/browse/MESOS-7951 > Project: Mesos > Issue Type: Improvement > Components: agent, executor, HTTP API >Reporter: Greg Mann >Assignee: Greg Mann > Labels: mesosphere > > After introducing the {{KillPolicy}} in MESOS-4909, some interactions with > framework developers have led to the suggestion of a couple possible > improvements to this interface. Namely, > * Allowing the framework to specify a command to be run to initiate > termination, rather than a signal to be sent, would allow some developers to > avoid wrapping their application in a signal handler. This is useful because > a signal handler wrapper modifies the application's process tree, which may > make introspection and debugging more difficult in the case of well-known > services with standard debugging procedures. > * In the case of terminations which do begin with a signal, it would be > useful to allow the framework to specify the signal to be sent, rather than > assuming SIGTERM. PostgreSQL, for example, permits several shutdown types, > each initiated with a [different > signal|https://www.postgresql.org/docs/9.3/static/server-shutdown.html]. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2
[ https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191565#comment-16191565 ] Vinod Kone commented on MESOS-7130: --- Story points? > port_mapping isolator: executor hangs when running on EC2 > - > > Key: MESOS-7130 > URL: https://issues.apache.org/jira/browse/MESOS-7130 > Project: Mesos > Issue Type: Bug > Components: executor >Reporter: Pierre Cheynier >Assignee: Jie Yu > > Hi, > I'm experiencing a weird issue: I'm using a CI to do testing on > infrastructure automation. > I recently activated the {{network/port_mapping}} isolator. > I'm able to make the changes work and pass the test for bare-metal servers > and virtualbox VMs using this configuration. > But when I try on EC2 (on which my CI pipeline rely) it systematically fails > to run any container. > It appears that the sandbox is created and the port_mapping isolator seems to > be OK according to the logs in stdout and stderr and the {tc} output : > {noformat} > + mount --make-rslave /run/netns > + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6 > + echo 1 > + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up > + ethtool -K eth0 rx off > (...) > + tc filter show dev eth0 parent :0 > + tc filter show dev lo parent :0 > I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2 > {noformat} > Then the executor never come back in REGISTERED state and hang indefinitely. > {GLOG_v=3} doesn't help here. > My skills in this area are limited, but trying to load the symbols and attach > a gdb to the mesos-executor process, I'm able to print this stack: > {noformat} > #0 0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from > /usr/lib64/libpthread.so.0 > #1 0x7feffbed69ec in > std::condition_variable::wait(std::unique_lock&) () from > /usr/lib64/libstdc++.so.6 > #2 0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from > /usr/lib64/libmesos-1.0.2.so > #3 0x7ff0017d595d in Gate::arrive(long) () from > /usr/lib64/libmesos-1.0.2.so > #4 0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) > () from /usr/lib64/libmesos-1.0.2.so > #5 0x7ff0017c5c05 in process::wait(process::UPID const&, Duration > const&) () from /usr/lib64/libmesos-1.0.2.so > #6 0x004ab26f in process::wait(process::ProcessBase const*, Duration > const&) () > #7 0x004a3903 in main () > {noformat} > I concluded that the underlying shell script launched by the isolator or the > task itself is just .. blocked. But I don't understand why. > Here is a process tree to show that I've no task running but the executor is: > {noformat} > root 28420 0.8 3.0 1061420 124940 ? Ssl 17:56 0:25 > /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 > --attributes=platform:centos;platform_major_version:7;type:base > --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup > --cgroups_net_cls_primary_handle=0xC370 > --container_logger=org_apache_mesos_LogrotateContainerLogger > --containerizers=mesos,docker > --credential=file:///etc/mesos-chef/slave-credential > --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]} > --default_role=default --docker_registry=/usr/share/mesos/users > --docker_store_dir=/var/opt/mesos/store/docker > --egress_unique_flow_per_container --enforce_container_disk_quota > --ephemeral_ports_per_container=128 > --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"} > --image_providers=docker --image_provisioner_backend=copy > --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping > --logging_level=INFO > --master=zk://mesos:test@localhost.localdomain:2181/mesos > --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 > --recover=reconnect > --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict > --work_dir=/var/opt/mesos > root 28484 0.0 2.3 433676 95016 ?Ssl 17:56 0:00 \_ > mesos-logrotate-logger --help=false > --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout > --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB > root 28485 0.0 2.3 499212 94724 ?Ssl 17:56 0:00 \_ > mesos-logrotate-logger --help=false > --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be
[jira] [Commented] (MESOS-7975) The command/default/docker executor can incorrectly send a TASK_FINISHED update even when the task is killed
[ https://issues.apache.org/jira/browse/MESOS-7975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191564#comment-16191564 ] Vinod Kone commented on MESOS-7975: --- [~qianzhang] Can you add story points for this? > The command/default/docker executor can incorrectly send a TASK_FINISHED > update even when the task is killed > > > Key: MESOS-7975 > URL: https://issues.apache.org/jira/browse/MESOS-7975 > Project: Mesos > Issue Type: Bug >Reporter: Anand Mazumdar >Assignee: Qian Zhang >Priority: Critical > Labels: mesosphere > > Currently, when a task is killed, the default/command/docker executor > incorrectly send a {{TASK_FINISHED}} status update instead of > {{TASK_KILLED}}. This is due to an unfortunate missed conditional check when > the task exits with a zero status code. > {code} > if (WSUCCEEDED(status)) { > taskState = TASK_FINISHED; > } else if (killed) { > // Send TASK_KILLED if the task was killed as a result of > // kill() or shutdown(). > taskState = TASK_KILLED; > } else { > taskState = TASK_FAILED; > } > {code} > We should modify the code to correctly send {{TASK_KILLED}} status updates > when a task is killed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8051) Killing TASK_GROUP fail to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Dukhovniy updated MESOS-8051: Summary: Killing TASK_GROUP fail to kill some tasks (was: Killing TASK_GROUP fails to kill some tasks) > Killing TASK_GROUP fail to kill some tasks > -- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210932 4753 mast
[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fails to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191542#comment-16191542 ] A. Dukhovniy commented on MESOS-8051: - It also has nothing to do with the fact that {{ct1}} container has a docker image - in another test I removed it and the result is the same - one of the containers will fail to stop. > Killing TASK_GROUP fails to kill some tasks > --- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@1
[jira] [Updated] (MESOS-8047) SubprocessTest.Status does not always receive a signal
[ https://issues.apache.org/jira/browse/MESOS-8047?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-8047: --- Labels: flaky-test (was: ) > SubprocessTest.Status does not always receive a signal > -- > > Key: MESOS-8047 > URL: https://issues.apache.org/jira/browse/MESOS-8047 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers > Labels: flaky-test > > This one seems to be different from MESOS-1705 and MESOS-1738. It might be > that previous test runs leave a mesos process running in the background, but > I didn't investigate very deeply: > {code} > [ RUN ] SubprocessTest.Status > /home/bevers/src/mesos/worktrees/master/3rdparty/libprocess/src/tests/subprocess_tests.cpp:281: > Failure > Expecting WIFSIGNALED(s.get().status()()->get()) but > WIFEXITED(s.get().status()()->get()) is true and > WEXITSTATUS(s.get().status()()->get()) is 0 > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7589) CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled is flaky
[ https://issues.apache.org/jira/browse/MESOS-7589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7589: --- Labels: flaky-test mesosphere (was: mesosphere) > CommandExecutorCheckTest.CommandCheckDeliveredAndReconciled is flaky > > > Key: MESOS-7589 > URL: https://issues.apache.org/jira/browse/MESOS-7589 > Project: Mesos > Issue Type: Bug >Reporter: Neil Conway > Labels: flaky-test, mesosphere > Attachments: command_check_fail.txt > > > See attached test log; observed on ASF CI. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-7971) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test if flaky
[ https://issues.apache.org/jira/browse/MESOS-7971?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7971: --- Labels: flaky-test mesosphere (was: ) > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove test if flaky > - > > Key: MESOS-7971 > URL: https://issues.apache.org/jira/browse/MESOS-7971 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.4.0 >Reporter: Vinod Kone > Labels: flaky-test, mesosphere > > Saw this when testing 1.4.0-rc5 > {code} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0912 05:40:27.335222 30860 cluster.cpp:162] Creating default 'local' > authorizer > I0912 05:40:27.338429 30867 master.cpp:442] Master > 2bd1e8eb-e314-4181-9ed3-d397ec1dbede (6aa774430302) started on > 172.17.0.3:54639 > I0912 05:40:27.338472 30867 master.cpp:444] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/hH0YXe/credentials" > --filter_gpu_resources="true" --framework_sorter="drf" --help="false" > --hostname_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" > --max_agent_ping_timeouts="5" --max_completed_frameworks="50" > --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/hH0YXe/master" > --zk_session_timeout="10secs" > I0912 05:40:27.338778 30867 master.cpp:494] Master only allowing > authenticated frameworks to register > I0912 05:40:27.338788 30867 master.cpp:508] Master only allowing > authenticated agents to register > I0912 05:40:27.338793 30867 master.cpp:521] Master only allowing > authenticated HTTP frameworks to register > I0912 05:40:27.338799 30867 credentials.hpp:37] Loading credentials for > authentication from '/tmp/hH0YXe/credentials' > I0912 05:40:27.353009 30867 master.cpp:566] Using default 'crammd5' > authenticator > I0912 05:40:27.353183 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0912 05:40:27.353364 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0912 05:40:27.353482 30867 http.cpp:1026] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0912 05:40:27.353588 30867 master.cpp:646] Authorization enabled > W0912 05:40:27.353605 30867 master.cpp:709] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0912 05:40:27.353742 30868 hierarchical.cpp:171] Initialized hierarchical > allocator process > I0912 05:40:27.353775 30872 whitelist_watcher.cpp:77] No whitelist given > I0912 05:40:27.356655 30873 master.cpp:2163] Elected as the leading master! > I0912 05:40:27.356675 30873 master.cpp:1702] Recovering from registrar > I0912 05:40:27.356868 30874 registrar.cpp:347] Recovering registrar > I0912 05:40:27.357390 30874 registrar.cpp:391] Successfully fetched the > registry (0B) in 494080ns > I0912 05:40:27.357483 30874 registrar.cpp:495] Applied 1 operations in > 31911ns; attempting to update the registry > I0912 05:40:27.357919 30874 registrar.cpp:552] Successfully updated the > registry in 391936ns > I0912 05:40:27.358018 30874 registrar.cpp:424] Successfully recovered > registrar > I0912 05:40:27.358413 30868 master.cpp:1801] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > I0912 05:40:27.358482 30867 hierarchical.cpp:209] Skipping recovery of > hierarchical allocator: nothing to recover > W0912 05:40:27.364050 30860 process.cpp:3196] Attempted to spawn already > running process files@172.17.0.3:54639 > I0912 05:40:27.365372 30860 containerizer.cpp:246] Using isolation: > posix/cpu,posix/mem,filesystem/posix,network/cni,environment_secret > W0912 05:40:27.365909 30860 backend.cpp:76] Failed to create 'aufs' backend: > AufsBacke
[jira] [Assigned] (MESOS-7739) RegisterSlaveValidationTest.DropInvalidReregistration is flaky
[ https://issues.apache.org/jira/browse/MESOS-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov reassigned MESOS-7739: -- Assignee: (was: Neil Conway) > RegisterSlaveValidationTest.DropInvalidReregistration is flaky > -- > > Key: MESOS-7739 > URL: https://issues.apache.org/jira/browse/MESOS-7739 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone > Labels: flaky-test, mesosphere-oncall > > Observed this on ASF CI. > Seems a bit different from MESOS-7441. > {code} > [ RUN ] RegisterSlaveValidationTest.DropInvalidReregistration > I0629 05:23:17.367363 2252 cluster.cpp:162] Creating default 'local' > authorizer > I0629 05:23:17.370198 2276 master.cpp:436] Master > 25091bef-3845-4bb6-ae23-e18ac0f4d174 (b3c104d65da7) started on > 172.17.0.3:42034 > I0629 05:23:17.370234 2276 master.cpp:438] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" - > -allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --au > thenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/V0UvSM/credentials" > --framework_sorter="drf" --help="false" --hostn > ame_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-1.3.1/_inst/share/mesos/webui" > --work_dir="/tmp/V0UvSM/master" --zk_session_timeout="10secs" > I0629 05:23:17.370513 2276 master.cpp:488] Master only allowing > authenticated frameworks to register > I0629 05:23:17.370525 2276 master.cpp:502] Master only allowing > authenticated agents to register > I0629 05:23:17.370534 2276 master.cpp:515] Master only allowing > authenticated HTTP frameworks to register > I0629 05:23:17.370543 2276 credentials.hpp:37] Loading credentials for > authentication from '/tmp/V0UvSM/credentials' > I0629 05:23:17.370806 2276 master.cpp:560] Using default 'crammd5' > authenticator > I0629 05:23:17.370929 2276 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0629 05:23:17.371073 2276 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0629 05:23:17.371193 2276 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0629 05:23:17.371318 2276 master.cpp:640] Authorization enabled > I0629 05:23:17.371455 2272 hierarchical.cpp:158] Initialized hierarchical > allocator process > I0629 05:23:17.371477 2290 whitelist_watcher.cpp:77] No whitelist given > I0629 05:23:17.373731 2277 master.cpp:2161] Elected as the leading master! > I0629 05:23:17.373760 2277 master.cpp:1700] Recovering from registrar > I0629 05:23:17.373891 2280 registrar.cpp:345] Recovering registrar > I0629 05:23:17.374527 2280 registrar.cpp:389] Successfully fetched the > registry (0B) in 593152ns > I0629 05:23:17.374625 2280 registrar.cpp:493] Applied 1 operations in > 19216ns; attempting to update the registry > I0629 05:23:17.375228 2280 registrar.cpp:550] Successfully updated the > registry in 555008ns > I0629 05:23:17.375336 2280 registrar.cpp:422] Successfully recovered > registrar > I0629 05:23:17.375826 2282 hierarchical.cpp:185] Skipping recovery of > hierarchical allocator: nothing to recover > I0629 05:23:17.375850 2290 master.cpp:1799] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > I0629 05:23:17.380674 2252 containerizer.cpp:221] Using isolation: > posix/cpu,posix/mem,filesystem/posix,network/cni > W0629 05:23:17.381237 2252 backend.cpp:76] Failed to create 'aufs' backend: > AufsBackend requires root privileges > W0629 05:23:17.381350 2252 backend.cpp:76] Failed to create 'bind' backend: > BindBackend requires root privileges > I0629 05:23:17.381384 2252 provisioner.cpp:249] Using default backend 'copy' > I0629 05:23:17.383884 2252 cluster.cpp:448] Creating default 'local' > authorizer > I0629 05:23:17.385763 2281 slave.cpp:231] Mesos
[jira] [Updated] (MESOS-7739) RegisterSlaveValidationTest.DropInvalidReregistration is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7739?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7739: --- Summary: RegisterSlaveValidationTest.DropInvalidReregistration is flaky. (was: RegisterSlaveValidationTest.DropInvalidReregistration is flaky) > RegisterSlaveValidationTest.DropInvalidReregistration is flaky. > --- > > Key: MESOS-7739 > URL: https://issues.apache.org/jira/browse/MESOS-7739 > Project: Mesos > Issue Type: Bug >Reporter: Vinod Kone > Labels: flaky-test, mesosphere-oncall > > Observed this on ASF CI. > Seems a bit different from MESOS-7441. > {code} > [ RUN ] RegisterSlaveValidationTest.DropInvalidReregistration > I0629 05:23:17.367363 2252 cluster.cpp:162] Creating default 'local' > authorizer > I0629 05:23:17.370198 2276 master.cpp:436] Master > 25091bef-3845-4bb6-ae23-e18ac0f4d174 (b3c104d65da7) started on > 172.17.0.3:42034 > I0629 05:23:17.370234 2276 master.cpp:438] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" - > -allocator="HierarchicalDRF" --authenticate_agents="true" > --authenticate_frameworks="true" --authenticate_http_frameworks="true" > --authenticate_http_readonly="true" --au > thenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/V0UvSM/credentials" > --framework_sorter="drf" --help="false" --hostn > ame_lookup="true" --http_authenticators="basic" > --http_framework_authenticators="basic" --initialize_driver_logging="true" > --log_auto_initialize="true" --logbufsecs="0" > --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --port="5050" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/mesos/mesos-1.3.1/_inst/share/mesos/webui" > --work_dir="/tmp/V0UvSM/master" --zk_session_timeout="10secs" > I0629 05:23:17.370513 2276 master.cpp:488] Master only allowing > authenticated frameworks to register > I0629 05:23:17.370525 2276 master.cpp:502] Master only allowing > authenticated agents to register > I0629 05:23:17.370534 2276 master.cpp:515] Master only allowing > authenticated HTTP frameworks to register > I0629 05:23:17.370543 2276 credentials.hpp:37] Loading credentials for > authentication from '/tmp/V0UvSM/credentials' > I0629 05:23:17.370806 2276 master.cpp:560] Using default 'crammd5' > authenticator > I0629 05:23:17.370929 2276 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0629 05:23:17.371073 2276 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0629 05:23:17.371193 2276 http.cpp:975] Creating default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0629 05:23:17.371318 2276 master.cpp:640] Authorization enabled > I0629 05:23:17.371455 2272 hierarchical.cpp:158] Initialized hierarchical > allocator process > I0629 05:23:17.371477 2290 whitelist_watcher.cpp:77] No whitelist given > I0629 05:23:17.373731 2277 master.cpp:2161] Elected as the leading master! > I0629 05:23:17.373760 2277 master.cpp:1700] Recovering from registrar > I0629 05:23:17.373891 2280 registrar.cpp:345] Recovering registrar > I0629 05:23:17.374527 2280 registrar.cpp:389] Successfully fetched the > registry (0B) in 593152ns > I0629 05:23:17.374625 2280 registrar.cpp:493] Applied 1 operations in > 19216ns; attempting to update the registry > I0629 05:23:17.375228 2280 registrar.cpp:550] Successfully updated the > registry in 555008ns > I0629 05:23:17.375336 2280 registrar.cpp:422] Successfully recovered > registrar > I0629 05:23:17.375826 2282 hierarchical.cpp:185] Skipping recovery of > hierarchical allocator: nothing to recover > I0629 05:23:17.375850 2290 master.cpp:1799] Recovered 0 agents from the > registry (129B); allowing 10mins for agents to re-register > I0629 05:23:17.380674 2252 containerizer.cpp:221] Using isolation: > posix/cpu,posix/mem,filesystem/posix,network/cni > W0629 05:23:17.381237 2252 backend.cpp:76] Failed to create 'aufs' backend: > AufsBackend requires root privileges > W0629 05:23:17.381350 2252 backend.cpp:76] Failed to create 'bind' backend: > BindBackend requires root privileges > I0629 05:23:17.381384 2252 provisioner.cpp:249] Using default backend 'copy' > I0629 05:23:17.383884 2252
[jira] [Updated] (MESOS-8051) Killing TASK_GROUP fails to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Dukhovniy updated MESOS-8051: Description: When starting following pod definition via marathon: {code:java} { "id": "/simple-pod", "scaling": { "kind": "fixed", "instances": 3 }, "environment": { "PING": "PONG" }, "containers": [ { "name": "ct1", "resources": { "cpus": 0.1, "mem": 32 }, "image": { "kind": "MESOS", "id": "busybox" }, "exec": { "command": { "shell": "while true; do echo the current time is $(date) > ./test-v1/clock; sleep 1; done" } }, "volumeMounts": [ { "name": "v1", "mountPath": "test-v1" } ] }, { "name": "ct2", "resources": { "cpus": 0.1, "mem": 32 }, "exec": { "command": { "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 1; done" } }, "volumeMounts": [ { "name": "v1", "mountPath": "etc" }, { "name": "v2", "mountPath": "docker" } ] } ], "networks": [ { "mode": "host" } ], "volumes": [ { "name": "v1" }, { "name": "v2", "host": "/var/lib/docker" } ] } {code} mesos will successfully kill all {{ct2}} containers but fail to kill all/some of the {{ct1}} containers. I've attached both master and agent logs. The interesting part starts after marathon issues 6 kills: {code:java} Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210932 4753 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210968 4753 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.211210 4747 master.cpp:5297] Processing
[jira] [Updated] (MESOS-7082) ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky.
[ https://issues.apache.org/jira/browse/MESOS-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7082: --- Summary: ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky. (was: ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is flaky) > ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 is > flaky. > - > > Key: MESOS-7082 > URL: https://issues.apache.org/jira/browse/MESOS-7082 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.2.0 > Environment: ubuntu 16.04 with/without SSL > Fedora 23 >Reporter: Anand Mazumdar >Priority: Critical > Labels: flaky, flaky-test, mesosphere > > Showed up on our internal CI > {noformat} > 07:00:17 [ RUN ] > ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0 > 07:00:17 I0207 07:00:17.775459 2952 cluster.cpp:160] Creating default > 'local' authorizer > 07:00:17 I0207 07:00:17.776511 2970 master.cpp:383] Master > fa1554c4-572a-4b89-8994-a89460f588d3 (ip-10-153-254-29.ec2.internal) started > on 10.153.254.29:38570 > 07:00:17 I0207 07:00:17.776538 2970 master.cpp:385] Flags at startup: > --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="1secs" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/ZROfJk/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --max_unreachable_tasks_per_framework="1000" --quiet="false" > --recovery_agent_removal_limit="100%" --registry="in_memory" > --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" > --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" > --registry_store_timeout="100secs" --registry_strict="false" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/ZROfJk/master" > --zk_session_timeout="10secs" > 07:00:17 I0207 07:00:17.776674 2970 master.cpp:435] Master only allowing > authenticated frameworks to register > 07:00:17 I0207 07:00:17.776687 2970 master.cpp:449] Master only allowing > authenticated agents to register > 07:00:17 I0207 07:00:17.776695 2970 master.cpp:462] Master only allowing > authenticated HTTP frameworks to register > 07:00:17 I0207 07:00:17.776703 2970 credentials.hpp:37] Loading credentials > for authentication from '/tmp/ZROfJk/credentials' > 07:00:17 I0207 07:00:17.776779 2970 master.cpp:507] Using default 'crammd5' > authenticator > 07:00:17 I0207 07:00:17.776841 2970 http.cpp:919] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > 07:00:17 I0207 07:00:17.776919 2970 http.cpp:919] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > 07:00:17 I0207 07:00:17.776970 2970 http.cpp:919] Using default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > 07:00:17 I0207 07:00:17.777009 2970 master.cpp:587] Authorization enabled > 07:00:17 I0207 07:00:17.777122 2975 hierarchical.cpp:161] Initialized > hierarchical allocator process > 07:00:17 I0207 07:00:17.777138 2974 whitelist_watcher.cpp:77] No whitelist > given > 07:00:17 I0207 07:00:17.04 2976 master.cpp:2123] Elected as the leading > master! > 07:00:17 I0207 07:00:17.26 2976 master.cpp:1645] Recovering from > registrar > 07:00:17 I0207 07:00:17.84 2975 registrar.cpp:329] Recovering registrar > 07:00:17 I0207 07:00:17.777989 2973 registrar.cpp:362] Successfully fetched > the registry (0B) in 176384ns > 07:00:17 I0207 07:00:17.778023 2973 registrar.cpp:461] Applied 1 operations > in 7573ns; attempting to update the registry > 07:00:17 I0207 07:00:17.778249 2976 registrar.cpp:506] Successfully updated > the registry in 210944ns > 07:00:17 I0207 07:00:17.778290 2976 registrar.cpp:392] Successfully > recovered registrar > 07:00:17 I0207 07:00:17.778373 2976 master.cpp:1761] Recovered 0 agents from > the registry (172B); allowing 10mins for agents to re-register > 07:00:17 I0207 07:00:17.778394 2974 hierarchical.cpp:188] Skipping recovery > of hierarchical allocator: nothing to recover > 07:00:17 I0207 07:00:17.869381 2952 containerizer.cpp:220] Using isol
[jira] [Commented] (MESOS-8051) Killing TASK_GROUP fails to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191502#comment-16191502 ] A. Dukhovniy commented on MESOS-8051: - Here logs for one of the failing tasks from master: {code:java} 40268:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 40269:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 40287:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.331063 4747 master.cpp:6841] Status update TASK_KILLING (UUID: 23c6e28b-4370-4da3-981c-13a121b145c0) for task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 from agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (10.0.1.207) 40288:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.331110 4747 master.cpp:6903] Forwarding status update TASK_KILLING (UUID: 23c6e28b-4370-4da3-981c-13a121b145c0) for task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 40289:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.331193 4747 master.cpp:8928] Updating the state of task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (latest state: TASK_KILLING, status update state: TASK_KILLING) 40297:Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.341003 4750 master.cpp:5479] Processing ACKNOWLEDGE call 23c6e28b-4370-4da3-981c-13a121b145c0 for task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 on agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 40337:Oct 04 14:58:35 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:35.229382 4746 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 40338:Oct 04 14:58:35 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:35.229418 4746 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 40372:Oct 04 14:58:55 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:55.168781 4752 master.cpp:6841] Status update TASK_FAILED (UUID: 57b5c03e-517c-4dc2-8592-c24e5c875fde) for task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 from agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 (10.0.1.207) {code} It takes ~30s, marathon issues 2 kills in the meantime and eventually {{TASK_FAILED}} is received. > Killing TASK_GROUP fails to kill some tasks > --- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "whi
[jira] [Updated] (MESOS-8051) Killing TASK_GROUP fails to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Dukhovniy updated MESOS-8051: Attachment: screenshot-1.png > Killing TASK_GROUP fails to kill some tasks > --- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz, > screenshot-1.png > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210932 4753 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca
[jira] [Updated] (MESOS-6086) PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove is flaky.
[ https://issues.apache.org/jira/browse/MESOS-6086?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-6086: --- Labels: flaky-test tech-debt (was: tech-debt) > PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove is flaky. > - > > Key: MESOS-6086 > URL: https://issues.apache.org/jira/browse/MESOS-6086 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Mahler >Assignee: Neil Conway > Labels: flaky-test, tech-debt > > Observed this when running on a CentOS 7 machine. > Good Run: > {noformat} > [ RUN ] PersistentVolumeEndpointsTest.EndpointCreateThenOfferRemove > I0824 14:24:15.585021 19320 cluster.cpp:157] Creating default 'local' > authorizer > I0824 14:24:15.590765 19320 replica.cpp:776] Replica recovered with log > positions 0 -> 0 with 1 holes and 0 unlearned > I0824 14:24:15.593570 19370 recover.cpp:451] Starting replica recovery > I0824 14:24:15.594476 19370 recover.cpp:477] Replica is in EMPTY status > I0824 14:24:15.597961 19352 replica.cpp:673] Replica in EMPTY status received > a broadcasted recover request from __req_res__(1)@10.0.49.2:38017 > I0824 14:24:15.599189 19351 recover.cpp:197] Received a recover response from > a replica in EMPTY status > I0824 14:24:15.600607 19364 recover.cpp:568] Updating replica status to > STARTING > I0824 14:24:15.601824 19336 replica.cpp:320] Persisted replica status to > STARTING > I0824 14:24:15.602224 19351 recover.cpp:477] Replica is in STARTING status > I0824 14:24:15.603526 19373 replica.cpp:673] Replica in STARTING status > received a broadcasted recover request from __req_res__(2)@10.0.49.2:38017 > I0824 14:24:15.603824 19375 recover.cpp:197] Received a recover response from > a replica in STARTING status > I0824 14:24:15.604395 19380 recover.cpp:568] Updating replica status to VOTING > I0824 14:24:15.605470 19334 replica.cpp:320] Persisted replica status to > VOTING > I0824 14:24:15.605612 19375 recover.cpp:582] Successfully joined the Paxos > group > I0824 14:24:15.607223 19367 master.cpp:379] Master > dff6317e-46bf-4bf1-8a56-3fcdfb3df5e5 (core-dev) started on 10.0.49.2:38017 > I0824 14:24:15.607286 19367 master.cpp:381] Flags at startup: --acls="" > --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" > --allocation_interval="50ms" --allocator="HierarchicalDRF" > --authenticate_agents="true" --authenticate_frameworks="true" > --authenticate_http_frameworks="true" --authenticate_http_readonly="true" > --authenticate_http_readwrite="true" --authenticators="crammd5" > --authorizers="local" --credentials="/tmp/DZsoQK/credentials" > --framework_sorter="drf" --help="false" --hostname_lookup="true" > --http_authenticators="basic" --http_framework_authenticators="basic" > --initialize_driver_logging="true" --log_auto_initialize="true" > --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" > --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" > --quiet="false" --recovery_agent_removal_limit="100%" > --registry="replicated_log" --registry_fetch_timeout="1mins" > --registry_store_timeout="100secs" --registry_strict="true" --roles="role1" > --root_submissions="true" --user_sorter="drf" --version="false" > --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/DZsoQK/master" > --zk_session_timeout="10secs" > I0824 14:24:15.609459 19367 master.cpp:431] Master only allowing > authenticated frameworks to register > I0824 14:24:15.609486 19367 master.cpp:445] Master only allowing > authenticated agents to register > I0824 14:24:15.609566 19367 master.cpp:458] Master only allowing > authenticated HTTP frameworks to register > I0824 14:24:15.609591 19367 credentials.hpp:37] Loading credentials for > authentication from '/tmp/DZsoQK/credentials' > I0824 14:24:15.610335 19367 master.cpp:503] Using default 'crammd5' > authenticator > I0824 14:24:15.610589 19367 authenticator.cpp:519] Initializing server SASL > I0824 14:24:15.611868 19367 http.cpp:883] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readonly' > I0824 14:24:15.612370 19367 http.cpp:883] Using default 'basic' HTTP > authenticator for realm 'mesos-master-readwrite' > I0824 14:24:15.612555 19367 http.cpp:883] Using default 'basic' HTTP > authenticator for realm 'mesos-master-scheduler' > I0824 14:24:15.612905 19367 master.cpp:583] Authorization enabled > W0824 14:24:15.612949 19367 master.cpp:646] The '--roles' flag is deprecated. > This flag will be removed in the future. See the Mesos 0.27 upgrade notes for > more information > I0824 14:24:15.624155 19356 master.cpp:1855] Elected as the leading master! > I0824 14:24:15.624238 19356 master.cpp:1551] Recovering from registrar > I0824 14:24:15.626255 19336 log.cpp:553] Attempting
[jira] [Updated] (MESOS-8051) Killing TASK_GROUP fails to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Dukhovniy updated MESOS-8051: Attachment: dcos-mesos-master.log.gz dcos-mesos-slave.log.gz Master and agent logs > Killing TASK_GROUP fails to kill some tasks > --- > > Key: MESOS-8051 > URL: https://issues.apache.org/jira/browse/MESOS-8051 > Project: Mesos > Issue Type: Bug > Components: agent, executor >Affects Versions: 1.4.0 >Reporter: A. Dukhovniy >Priority: Critical > Attachments: dcos-mesos-master.log.gz, dcos-mesos-slave.log.gz > > > When starting following pod definition via marathon: > {code:java} > { > "id": "/simple-pod", > "scaling": { > "kind": "fixed", > "instances": 3 > }, > "environment": { > "PING": "PONG" > }, > "containers": [ > { > "name": "ct1", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "image": { > "kind": "MESOS", > "id": "busybox" > }, > "exec": { > "command": { > "shell": "while true; do echo the current time is $(date) > > ./test-v1/clock; sleep 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "test-v1" > } > ] > }, > { > "name": "ct2", > "resources": { > "cpus": 0.1, > "mem": 32 > }, > "exec": { > "command": { > "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep > 1; done" > } > }, > "volumeMounts": [ > { > "name": "v1", > "mountPath": "etc" > }, > { > "name": "v2", > "mountPath": "docker" > } > ] > } > ], > "networks": [ > { > "mode": "host" > } > ], > "volumes": [ > { > "name": "v1" > }, > { > "name": "v2", > "host": "/var/lib/docker" > } > ] > } > {code} > mesos will successfully kill all {{ct2}} containers but fail to kill all/some > of the {{ct1}} containers. I've attached both master and agent logs. The > interesting part starts after marathon issues 6 kills: > {code:java} > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d > bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing > KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d > bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) > at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling > agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( > 10.0.1.207) to kill task > simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework > bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at > scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 > .229:15101 > Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal > mesos-master[4708]: I1004 14:58:25.210932 4753 master.cpp:5297] Processing
[jira] [Updated] (MESOS-8005) Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky
[ https://issues.apache.org/jira/browse/MESOS-8005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-8005: --- Labels: flaky-test mesosphere (was: ) > Mesos.SlaveTest.ShutdownUnregisteredExecutor is flaky > - > > Key: MESOS-8005 > URL: https://issues.apache.org/jira/browse/MESOS-8005 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers > Labels: flaky-test, mesosphere > Attachments: jenkins.log.gz > > > Executed on Ubuntu 17.04 w/ SSL enabled: > {code} > ../../src/tests/cluster.cpp:580 > Value of: containers->empty() > Actual: false > Expected: true > Failed to destroy containers: { 86d690bc-4248-4d26-bdc7-28901d8cf2ab } > {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-8051) Killing TASK_GROUP fails to kill some tasks
[ https://issues.apache.org/jira/browse/MESOS-8051?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] A. Dukhovniy updated MESOS-8051: Description: When starting following pod definition via marathon: {code:java} { "id": "/simple-pod", "scaling": { "kind": "fixed", "instances": 3 }, "environment": { "PING": "PONG" }, "containers": [ { "name": "ct1", "resources": { "cpus": 0.1, "mem": 32 }, "image": { "kind": "MESOS", "id": "busybox" }, "exec": { "command": { "shell": "while true; do echo the current time is $(date) > ./test-v1/clock; sleep 1; done" } }, "volumeMounts": [ { "name": "v1", "mountPath": "test-v1" } ] }, { "name": "ct2", "resources": { "cpus": 0.1, "mem": 32 }, "exec": { "command": { "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 1; done" } }, "volumeMounts": [ { "name": "v1", "mountPath": "etc" }, { "name": "v2", "mountPath": "docker" } ] } ], "networks": [ { "mode": "host" } ], "volumes": [ { "name": "v1" }, { "name": "v2", "host": "/var/lib/docker" } ] } {code} mesos will successfully kill all {{ct2}} containers but fail to kill all/some of the {{ct1}} containers. I've attached both master and agent logs. The interesting part starts after marathon issues 6 kills: {code:java} Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210932 4753 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210968 4753 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.211210 4747 master.cpp:5297] Processing
[jira] [Created] (MESOS-8051) Killing TASK_GROUP fails to kill some tasks
A. Dukhovniy created MESOS-8051: --- Summary: Killing TASK_GROUP fails to kill some tasks Key: MESOS-8051 URL: https://issues.apache.org/jira/browse/MESOS-8051 Project: Mesos Issue Type: Bug Components: agent, executor Affects Versions: 1.4.0 Reporter: A. Dukhovniy Priority: Critical When starting following pod definition via marathon: {code:java} { "id": "/simple-pod", "scaling": { "kind": "fixed", "instances": 3 }, "environment": { "PING": "PONG" }, "containers": [ { "name": "ct1", "resources": { "cpus": 0.1, "mem": 32 }, "image": { "kind": "MESOS", "id": "busybox" }, "exec": { "command": { "shell": "while true; do echo the current time is $(date) > ./test-v1/clock; sleep 1; done" } }, "volumeMounts": [ { "name": "v1", "mountPath": "test-v1" } ] }, { "name": "ct2", "resources": { "cpus": 0.1, "mem": 32 }, "exec": { "command": { "shell": "while true; do echo -n $PING ' '; cat ./etc/clock; sleep 1; done" } }, "volumeMounts": [ { "name": "v1", "mountPath": "etc" }, { "name": "v2", "mountPath": "docker" } ] } ], "networks": [ { "mode": "host" } ], "volumes": [ { "name": "v1" }, { "name": "v2", "host": "/var/lib/docker" } ] } {code} mesos will successfully kill all {{ct2}} containers but fail to kill all/some of the {{ct1}} containers. I've attached both master and agent logs. The interesting part starts after marathon issues 6 kills: {code:java} Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.209966 4746 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210033 4746 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210471 4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853d bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210518 4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c1098e5-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210602 4748 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d bf20.ct1' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210639 4748 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct1 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5 .229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210932 4753 master.cpp:5297] Processing KILL call for task 'simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853d bf20.ct2' of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (marathon) at scheduler-c61c493c-728f-4bd9-be60-7373574749af@10.0.5.229:15101 Oct 04 14:58:25 ip-10-0-5-229.eu-central-1.compute.internal mesos-master[4708]: I1004 14:58:25.210968 4753 master.cpp:5371] Telling agent bae11d5d-20c2-4d66-9ec3-773d1d717e58-S1 at slave(1)@10.0.1.207:5051 ( 10.0.1.207) to kill task simple-pod.instance-3c0ffca4-a914-11e7-bcd5-e63c853dbf20.ct2 of framework bae11d5d-20c2-4d66-9ec3-773d1d717e58-0001 (m
[jira] [Updated] (MESOS-7986) ExecutorHttpApiTest.ValidJsonButInvalidProtobuf and ExecutorHttpApiTest.NoContentType fail in parallel test execution
[ https://issues.apache.org/jira/browse/MESOS-7986?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-7986: --- Labels: flaky-test mesosphere (was: mesosphere) > ExecutorHttpApiTest.ValidJsonButInvalidProtobuf and > ExecutorHttpApiTest.NoContentType fail in parallel test execution > - > > Key: MESOS-7986 > URL: https://issues.apache.org/jira/browse/MESOS-7986 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.5.0 >Reporter: Benjamin Bannier > Labels: flaky-test, mesosphere > Attachments: test.log > > > When running cmake-built Mesos test in parallel, I reliably encounter failing > {{ExecutorHttpApiTest.ValidJsonButInvalidProtobuf}} or > {{ExecutorHttpApiTest.NoContentType}}, > {noformat} > $ ../support/mesos-gtest-runner.py ./src/mesos-tests -j10 > [ RUN ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf > ../src/tests/executor_http_api_tests.cpp:197: Failure > Value of: (response).get().status > Actual: "401 Unauthorized" > Expected: BadRequest().status > Which is: "400 Bad Request" > [ FAILED ] ExecutorHttpApiTest.ValidJsonButInvalidProtobuf (17 ms) > {noformat} > {noformat} > [ RUN ] ExecutorHttpApiTest.NoContentType > ../src/tests/executor_http_api_tests.cpp:158: Failure > Value of: (response).get().status > Actual: "401 Unauthorized" > Expected: BadRequest().status > Which is: "400 Bad Request" > [ FAILED ] ExecutorHttpApiTest.NoContentType (20 ms) > {noformat} > The machine has 16 physical cores. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (MESOS-3160) CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS Flaky
[ https://issues.apache.org/jira/browse/MESOS-3160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-3160: --- Labels: cgroups flaky-test mesosphere (was: cgroups mesosphere) > CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseRSS Flaky > > > Key: MESOS-3160 > URL: https://issues.apache.org/jira/browse/MESOS-3160 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.24.0, 0.26.0 >Reporter: Paul Brett > Labels: cgroups, flaky-test, mesosphere > > Test will occasionally with: > [ RUN ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseUnlockedRSS > ../../src/tests/containerizer/cgroups_tests.cpp:1103: Failure > helper.increaseRSS(getpagesize()): Failed to sync with the subprocess > ../../src/tests/containerizer/cgroups_tests.cpp:1103: Failure > helper.increaseRSS(getpagesize()): The subprocess has not been spawned yet > [ FAILED ] CgroupsAnyHierarchyMemoryPressureTest.ROOT_IncreaseUnlockedRSS > (223 ms) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (MESOS-7504) Parent's mount namespace cannot be determined when launching a nested container.
[ https://issues.apache.org/jira/browse/MESOS-7504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191456#comment-16191456 ] Andrei Budnik commented on MESOS-7504: -- {{(launch).failure(): Cannot get target mount namespace from process 10991: Cannot get 'mnt' namespace for 2nd-level child process '11001': Failed to stat mnt namespace handle for pid 11001: No such file or directory}} > Parent's mount namespace cannot be determined when launching a nested > container. > > > Key: MESOS-7504 > URL: https://issues.apache.org/jira/browse/MESOS-7504 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.3.0 > Environment: Ubuntu 16.04 >Reporter: Alexander Rukletsov >Assignee: Andrei Budnik > Labels: containerizer, flaky-test, mesosphere > > I've observed this failure twice in different Linux environments. Here is an > example of such failure: > {noformat} > [ RUN ] > NestedMesosContainerizerTest.ROOT_CGROUPS_DestroyDebugContainerOnRecover > I0509 21:53:25.471657 17167 containerizer.cpp:221] Using isolation: > cgroups/cpu,filesystem/linux,namespaces/pid,network/cni,volume/image > I0509 21:53:25.475124 17167 linux_launcher.cpp:150] Using > /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher > I0509 21:53:25.475407 17167 provisioner.cpp:249] Using default backend > 'overlay' > I0509 21:53:25.481232 17186 containerizer.cpp:608] Recovering containerizer > I0509 21:53:25.482295 17186 provisioner.cpp:410] Provisioner recovery complete > I0509 21:53:25.482587 17187 containerizer.cpp:1001] Starting container > 21bc372c-0f2c-49f5-b8ab-8d32c232b95d for executor 'executor' of framework > I0509 21:53:25.482918 17189 cgroups.cpp:410] Creating cgroup at > '/sys/fs/cgroup/cpu,cpuacct/mesos_test_d989f526-efe0-4553-bf79-936ad66c3753/21bc372c-0f2c-49f5-b8ab-8d32c232b95d' > for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d > I0509 21:53:25.484103 17190 cpu.cpp:101] Updated 'cpu.shares' to 1024 (cpus > 1) for container 21bc372c-0f2c-49f5-b8ab-8d32c232b95d > I0509 21:53:25.484808 17186 containerizer.cpp:1524] Launching > 'mesos-containerizer' with flags '--help="false" > --launch_info="{"clone_namespaces":[131072,536870912],"command":{"shell":true,"value":"sleep > > 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}]},"pre_exec_commands":[{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/home\/ubuntu\/workspace\/mesos\/Mesos_CI-build\/FLAG\/SSL\/label\/mesos-ec2-ubuntu-16.04\/mesos\/build\/src\/mesos-containerizer"},{"shell":true,"value":"mount > -n -t proc proc \/proc -o > nosuid,noexec,nodev"}],"working_directory":"\/tmp\/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr"}" > --pipe_read="29" --pipe_write="32" > --runtime_directory="/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_sKhtj7/containers/21bc372c-0f2c-49f5-b8ab-8d32c232b95d" > --unshare_namespace_mnt="false"' > I0509 21:53:25.484978 17189 linux_launcher.cpp:429] Launching container > 21bc372c-0f2c-49f5-b8ab-8d32c232b95d and cloning with namespaces CLONE_NEWNS > | CLONE_NEWPID > I0509 21:53:25.513890 17186 containerizer.cpp:1623] Checkpointing container's > forked pid 1873 to > '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_Rdjw6M/meta/slaves/frameworks/executors/executor/runs/21bc372c-0f2c-49f5-b8ab-8d32c232b95d/pids/forked.pid' > I0509 21:53:25.515878 17190 fetcher.cpp:353] Starting to fetch URIs for > container: 21bc372c-0f2c-49f5-b8ab-8d32c232b95d, directory: > /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_DestroyDebugContainerOnRecover_zlywyr > I0509 21:53:25.517715 17193 containerizer.cpp:1791] Starting nested container > 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35 > I0509 21:53:25.518569 17193 switchboard.cpp:545] Launching > 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" > --help="false" > --socket_address="/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b" > --stderr_from_fd="36" --stderr_to_fd="2" --stdin_to_fd="32" > --stdout_from_fd="33" --stdout_to_fd="1" --tty="false" > --wait_for_connection="true"' for container > 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522-622b15142e35 > I0509 21:53:25.521229 17193 switchboard.cpp:575] Created I/O switchboard > server (pid: 1881) listening on socket file > '/tmp/mesos-io-switchboard-ca463cf2-70ba-4121-a5c6-1a170ae40c1b' for > container > 21bc372c-0f2c-49f5-b8ab-8d32c232b95d.ea991d38-e1a5-44fe-a522
[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks
[ https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16191117#comment-16191117 ] Andrei Budnik commented on MESOS-4812: -- I have closed [/r/62381|https://reviews.apache.org/r/62381/], for details see comment in discard reason. > Mesos fails to escape command health checks > --- > > Key: MESOS-4812 > URL: https://issues.apache.org/jira/browse/MESOS-4812 > Project: Mesos > Issue Type: Bug >Affects Versions: 0.25.0 >Reporter: Lukas Loesche >Assignee: Andrei Budnik > Labels: health-check, mesosphere, tech-debt > Attachments: health_task.gif > > > As described in https://github.com/mesosphere/marathon/issues/ > I would like to run a command health check > {noformat} > /bin/bash -c " {noformat} > The health check fails because Mesos, while running the command inside double > quotes of a sh -c "" doesn't escape the double quotes in the command. > If I escape the double quotes myself the command health check succeeds. But > this would mean that the user needs intimate knowledge of how Mesos executes > his commands which can't be right. > I was told this is not a Marathon but a Mesos issue so am opening this JIRA. > I don't know if this only affects the command health check. -- This message was sent by Atlassian JIRA (v6.4.14#64029)