[jira] [Commented] (MESOS-10239) Installing Mesos on Oracle Linux 8.3
[ https://issues.apache.org/jira/browse/MESOS-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605513#comment-17605513 ] Charles Natali commented on MESOS-10239: Hi [~Mar_zieh], You don't need Python to install Mesos, unless you use Python bindings. If you're building from the source, you can just pass {{--disable-python}} as describe here: https://mesos.apache.org/documentation/latest/configuration/autotools/ Could you please details the error you're getting? > Installing Mesos on Oracle Linux 8.3 > > > Key: MESOS-10239 > URL: https://issues.apache.org/jira/browse/MESOS-10239 > Project: Mesos > Issue Type: Task >Reporter: Marzieh >Priority: Major > > some new versions of Linux like Oracle Linux 8, Redhat 8 , does not support > Python2 any more,however Mesos need to Python2. So, there is no way to > install Mesos in these environments. > Would you please make Mesos updated to be installed in new Linux > distributions? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos
[ https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584433#comment-17584433 ] Charles Natali commented on MESOS-10234: Hi Sangita, if this is an issue for you, you can simply use whatever zookeeper version you want, you do not need to use the shipped one. We could update zookeeper separately, the shipped version is quite old and has some known bugs - [~qianzhang] what do you think? > CVE-2021-44228 Log4j vulnerability for apache mesos > --- > > Key: MESOS-10234 > URL: https://issues.apache.org/jira/browse/MESOS-10234 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.11.0 >Reporter: Sangita Nalkar >Priority: Critical > > Hi, > Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache > mesos. > We see that log4j v1.2.17 is used while building apache mesos from source. > Snippet from build logs: > std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF > jvm/org/apache/.deps/libjava_la-log4j.Tpo -c > ../../src/jvm/org/apache/log4j.cpp -fPIC -DPIC -o > jvm/org/apache/.libs/libjava_la-log4j.o > Thanks, > Sangita -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (MESOS-10237) Mesos-slave issue report
[ https://issues.apache.org/jira/browse/MESOS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512051#comment-17512051 ] Charles Natali commented on MESOS-10237: Hi [~feixiachao], Are you having a specific problem or just wondering about those error messages? Those errors are benign and can be ignored - they've actually been fixed in master: https://github.com/apache/mesos/commit/6bc5a5e114077f542f7258adffb78a54849ddf90 > Mesos-slave issue report > - > > Key: MESOS-10237 > URL: https://issues.apache.org/jira/browse/MESOS-10237 > Project: Mesos > Issue Type: Bug >Reporter: feixiachao >Priority: Major > > we encountered an issue about mesos-slave , the mesos.ERROR log shown as > below: > E0323 22:56:03.278918 2848 memory.cpp:502] Listening on OOM events failed > for container ff408971-b610-4f84-bbc3-81b0c6be9499: Event listener is > terminating > E0323 22:58:06.018554 2834 memory.cpp:502] Listening on OOM events failed > for container 3afa2056-1976-4857-9121-cfad0f0ba73e: Event listener is > terminating > E0323 23:12:05.261996 2816 memory.cpp:502] Listening on OOM events failed > for container 56912877-5733-4050-bce8-0cc179cc0bc8: Event listener is > terminating > Could any someone to help for this issue ? > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos
[ https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492857#comment-17492857 ] Charles Natali commented on MESOS-10234: Hi, I cannot see an explicit dependency on log4j v1.2.17 - are you sure the build is not picking up your system's version? Then again I'm really not familiar with the java bindings. Note that the only log4j which is shipped with Mesos is part of the zookeeper version packaged: {noformat} ./build/3rdparty/zookeeper-3.4.8/lib/slf4j-log4j12-1.6.1.jar ./build/3rdparty/zookeeper-3.4.8/lib/log4j-1.2.16.LICENSE.txt ./build/3rdparty/zookeeper-3.4.8/lib/log4j-1.2.16.jar ./build/3rdparty/zookeeper-3.4.8/src/java/lib/log4j-1.2.16.LICENSE.txt ./build/3rdparty/zookeeper-3.4.8/src/contrib/loggraph/web/org/apache/zookeeper/graph/log4j.properties ./build/3rdparty/zookeeper-3.4.8/src/contrib/rest/conf/log4j.properties ./build/3rdparty/zookeeper-3.4.8/src/contrib/zooinspector/lib/log4j.properties ./build/3rdparty/zookeeper-3.4.8/conf/log4j.properties ./build/3rdparty/zookeeper-3.4.8/contrib/rest/lib/slf4j-log4j12-1.6.1.jar ./build/3rdparty/zookeeper-3.4.8/contrib/rest/lib/log4j-1.2.15.jar ./build/3rdparty/zookeeper-3.4.8/contrib/rest/conf/log4j.properties {noformat} I'm not sure if anyone uses the shipped version, but maybe we could update it, what do you think [~asekretenko]? Note that at work we experienced a zookeeper bug following a failover which IIRC caused some ephemeral nodes to not be deleted on the promoted leader, leading to inconsistencies in the Mesos registry - so updating could also solve this issue for whoever happens to use it. > CVE-2021-44228 Log4j vulnerability for apache mesos > --- > > Key: MESOS-10234 > URL: https://issues.apache.org/jira/browse/MESOS-10234 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.11.0 >Reporter: Sangita Nalkar >Priority: Critical > > Hi, > Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache > mesos. > We see that log4j v1.2.17 is used while building apache mesos from source. > Snippet from build logs: > std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF > jvm/org/apache/.deps/libjava_la-log4j.Tpo -c > ../../src/jvm/org/apache/log4j.cpp -fPIC -DPIC -o > jvm/org/apache/.libs/libjava_la-log4j.o > Thanks, > Sangita -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos
[ https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17467229#comment-17467229 ] Charles Natali commented on MESOS-10234: Hi [~snalkar] Sorry for the delay, but Mesos has very little resources, and holiday season doesn't help. I've had a quick look, and log4j only seems to be used for tests - Mesos is mostly written in C++, so it's not surprising. It's possible it's used in some third-party dependencies included, but I'd be surprised if it was exploitable. I'll have a more thorough look after the holidays. Cheers, > CVE-2021-44228 Log4j vulnerability for apache mesos > --- > > Key: MESOS-10234 > URL: https://issues.apache.org/jira/browse/MESOS-10234 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.11.0 >Reporter: Sangita Nalkar >Priority: Critical > > Hi, > Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache > mesos. > We see that log4j v1.2.17 is used while building apache mesos from source. > Snippet from build logs: > std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF > jvm/org/apache/.deps/libjava_la-log4j.Tpo -c > ../../src/jvm/org/apache/log4j.cpp -fPIC -DPIC -o > jvm/org/apache/.libs/libjava_la-log4j.o > Thanks, > Sangita -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Assigned] (MESOS-9657) Launching a command task twice can crash the agent
[ https://issues.apache.org/jira/browse/MESOS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Natali reassigned MESOS-9657: - Fix Version/s: 1.12.0 Assignee: Charles Natali Resolution: Fixed > Launching a command task twice can crash the agent > -- > > Key: MESOS-9657 > URL: https://issues.apache.org/jira/browse/MESOS-9657 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Assignee: Charles Natali >Priority: Major > Fix For: 1.12.0 > > > When launching a command task, we verify that the framework has no existing > executor for that task: > {noformat} > // We are dealing with command task; a new command executor will be > // launched. > CHECK(executor == nullptr); > {noformat} > and afterwards an executor is created with the same executor id as the task > id: > {noformat} > // (slave.cpp) > // Either the master explicitly requests launching a new executor > // or we are in the legacy case of launching one if there wasn't > // one already. Either way, let's launch executor now. > if (executor == nullptr) { > Try added = framework->addExecutor(executorInfo); > [...] > {noformat} > This means that if we relaunch the task with the same task id before the > executor is removed, it will crash the agent: > {noformat} > F0315 16:39:32.822818 38112 slave.cpp:2865] Check failed: executor == nullptr > *** Check failure stack trace: *** > @ 0x7feb29a407af google::LogMessage::Flush() > @ 0x7feb29a43c3f google::LogMessageFatal::~LogMessageFatal() > @ 0x7feb28a5a886 mesos::internal::slave::Slave::__run() > @ 0x7feb28af4f0e > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEclEOS3_ > @ 0x7feb2998a620 process::ProcessBase::consume() > @ 0x7feb29987675 process::ProcessManager::resume() > @ 0x7feb299a2d2b > _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvE3$_8E6_M_runEv > @ 0x7feb2632f523 (unknown) > @ 0x7feb25e40594 start_thread > @ 0x7feb25b73e6f __GI___clone > Aborted (core dumped) > {noformat} > Instead of crashing, the agent should just drop the task with an appropriate > error in this case. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10198) Mesos-master service is activating state
[ https://issues.apache.org/jira/browse/MESOS-10198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420368#comment-17420368 ] Charles Natali commented on MESOS-10198: [~kiranjshetty] I assume you've since moved on, so unless there is an update to this ticket soon, I will close. Cheers, > Mesos-master service is activating state > > > Key: MESOS-10198 > URL: https://issues.apache.org/jira/browse/MESOS-10198 > Project: Mesos > Issue Type: Task >Affects Versions: 1.9.0 >Reporter: Kiran J Shetty >Priority: Major > > mesos-master service showing activating state on all 3 master node and which > intern making marathon to restart frequently . in logs I can see below entry. > Mesos-master logs: > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a864206a9 > mesos::internal::log::ReplicaProcess::ReplicaProcess() > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a86420854 > mesos::internal::log::Replica::Replica() > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6a65 > mesos::internal::log::LogProcess::LogProcess() > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6e34 > mesos::log::Log::Log() > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a3ec72 main > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a8207 > __libc_start_main > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a40d0a (unknown) > Nov 12 08:36:29 servername systemd[1]: mesos-master.service: main process > exited, code=killed, status=6/ABRT > Nov 12 08:36:29 servername systemd[1]: Unit mesos-master.service entered > failed state. > Nov 12 08:36:29 servername systemd[1]: mesos-master.service failed. > Nov 12 08:36:49 servername systemd[1]: mesos-master.service holdoff time > over, scheduling restart. > Nov 12 08:36:49 servername systemd[1]: Stopped Mesos Master. > Nov 12 08:36:49 servername systemd[1]: Started Mesos Master. > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.633597 20024 > logging.cpp:201] INFO level logging started! > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634446 20024 > main.cpp:243] Build: 2019-10-21 12:10:14 by centos > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634460 20024 > main.cpp:244] Version: 1.9.0 > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634466 20024 > main.cpp:247] Git tag: 1.9.0 > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634470 20024 > main.cpp:251] Git SHA: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.636653 20024 > main.cpp:345] Using 'hierarchical' allocator > Nov 12 08:36:49 servername mesos-master[20037]: mesos-master: > ./db/skiplist.h:344: void leveldb::SkipList::Insert(const > Key&) [with Key = const char*; Comparator = > leveldb::MemTable::KeyComparator]: Assertion `x == __null || !Equal(key, > x->key)' failed. > Nov 12 08:36:49 servername mesos-master[20037]: *** Aborted at 1605150409 > (unix time) try "date -d @1605150409" if you are using GNU date *** > Nov 12 08:36:49 servername mesos-master[20037]: PC: @ 0x7fdee16ed387 > __GI_raise > Nov 12 08:36:49 servername mesos-master[20037]: *** SIGABRT (@0x4e38) > received by PID 20024 (TID 0x7fdee720ea00) from PID 20024; stack trace: *** > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee1fb2630 (unknown) > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16ed387 __GI_raise > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16eea78 __GI_abort > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e61a6 > __assert_fail_base > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e6252 > __GI___assert_fail > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cf3dc2 > leveldb::SkipList<>::Insert() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cf3735 > leveldb::MemTable::Add() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5d00168 > leveldb::WriteBatch::Iterate() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5d00424 > leveldb::WriteBatchInternal::InsertInto() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5ce8575 > leveldb::DBImpl::RecoverLogFile() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cec0fc > leveldb::DBImpl::Recover() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cec3fa > leveldb::DB::Open() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a0f877 > mesos::internal::log::LevelDBStorage::restore() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a817a2 > mesos::internal::log::ReplicaProcess::restore() > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a846a9 > mesos::internal::log::ReplicaProcess::ReplicaProcess()
[jira] [Commented] (MESOS-10230) Please update JQuery from 3.2.1 to 3.5.0+
[ https://issues.apache.org/jira/browse/MESOS-10230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420364#comment-17420364 ] Charles Natali commented on MESOS-10230: [~apeters] Would you be able to look at this? I think [~pengels] might be referring to https://github.com/apache/mesos/blob/master/src/webui/assets/libs/jquery-3.2.1.min.js Note however that we are also using jquery1.10.1 which is also affected: https://github.com/apache/mesos/blob/master/site/source/assets/js/jquery-1.10.1.min.js and in mesos-site: https://github.com/apache/mesos-site/blob/asf-site/content/assets/js/jquery-1.10.1.min.js I am absolutely not familiar with web development so even though I could probably update it I wouldn't know how to check if it broke anything. > Please update JQuery from 3.2.1 to 3.5.0+ > - > > Key: MESOS-10230 > URL: https://issues.apache.org/jira/browse/MESOS-10230 > Project: Mesos > Issue Type: Improvement > Components: security >Affects Versions: 1.11.0 >Reporter: p engels >Priority: Minor > > JQuery versions between 1.2 and 3.5.0 are vulnerable to multiple > cross-site-scripting vulnerabilities. More info can be found on JQuery's > website: > blog.jquery.com: [https://blog.jquery.com/2020/04/10/jquery-3-5-0-released/] > My organization's vulnerability scanner locates the out-of-date jquery at > this url (sanitized for security reasons): > [http://example.com:5050/assets/libs/jquery-3.2.1.min.js] > > Please remove the old version of JQuery and replace it with version 3.5.0 or > greater. If this is already planned for a future release, please comment on > this request with the version this will be fixed in. > > Keep up the good work, Apache community <3 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10228) My current problem is that after mesos-Agent added the option to support GPU, starting Docker through Marathon cannot succeed
[ https://issues.apache.org/jira/browse/MESOS-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420361#comment-17420361 ] Charles Natali commented on MESOS-10228: Hi [~barrylee], It's not clear to me if this is linked to the other issue you opened: https://issues.apache.org/jira/browse/MESOS-10227 Note that Marathon is a project distinct from Mesos, so you might want to report it with them (although I am not sure the project is still active). > My current problem is that after mesos-Agent added the option to support GPU, > starting Docker through Marathon cannot succeed > - > > Key: MESOS-10228 > URL: https://issues.apache.org/jira/browse/MESOS-10228 > Project: Mesos > Issue Type: Task > Components: agent, framework >Affects Versions: 1.11.0 >Reporter: barry lee >Priority: Major > Fix For: 1.11.0 > > Attachments: image-2021-08-19-19-22-51-456.png > > Original Estimate: 24h > Remaining Estimate: 24h > > My current problem is that after mesos-Agent added the option to support GPU, > starting Docker through Marathon cannot succeed. > mesos-agent \ > --master=zk://192.168.10.191:2181,192.168.10.192:2181,192.168.10.193:2181/mesos > \ > --log_dir=/var/log/mesos \ > --containerizers=docker,mesos \ > --executor_registration_timeout=5mins \ > --hostname=192.168.10.19 \ > --ip=192.168.10.19 \ > --port=5051 \ > --work_dir=/var/lib/mesos \ > --image_providers=docker \ > —executor_environment_variables="{}" \ > --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" > > In the MESos-Agent GPU option, this is useful when there is no GPU node. > > !image-2021-08-19-19-22-51-456.png! -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10227) After mesos-agent starts, mesos-exeute fails to be executed using the GPU
[ https://issues.apache.org/jira/browse/MESOS-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420360#comment-17420360 ] Charles Natali commented on MESOS-10227: Hi [~barrylee], Sorry for the delay. Is this still a problem? The log you're providing is truncated, it would be useful to get: - the agent logs, when the task is started - the executor log > After mesos-agent starts, mesos-exeute fails to be executed using the GPU > - > > Key: MESOS-10227 > URL: https://issues.apache.org/jira/browse/MESOS-10227 > Project: Mesos > Issue Type: Task > Components: agent >Affects Versions: 1.11.0 > Environment: mesos-agent \ > --master=zk://192.168.10.191:2181,192.168.10.192:2181,192.168.10.193:2181/mesos > \ > --log_dir=/var/log/mesos --containerizers=docker,mesos \ > --executor_registration_timeout=5mins \ > --hostname=192.168.10.19 \ > --ip=192.168.10.19 \ > --port=5051 \ > --work_dir=/var/lib/mesos \ > --image_providers=docker \ > —executor_environment_variables="{}" \ > --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia" > > > mesos-execute \ > --master=zk://192.168.10.191:2181,192.168.10.192:2181,192.168.10.193:2181/mesos > \ > --name=gpu-test \ > --docker_image=nvidia/cuda \ > --command="nvidia-smi" \ > --framework_capabilities="GPU_RESOURCES" \ > --resources="gpus:1" > >Reporter: barry lee >Priority: Major > Fix For: 1.11.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > I0819 18:14:26.088129 9337 containerizer.cpp:3414] Transitioning the state of > container fab468e6-bcbd-499c-9c24-ccd572c8317b from PROVISIONING to > DESTROYING after 2.207289088secs > I0819 18:14:26.089609 9339 slave.cpp:7100] Executor 'gpu-test' of framework > d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 has terminated with unknown status > I0819 18:14:26.091435 9339 slave.cpp:5981] Handling status update TASK_FAILED > (Status UUID: 0abd4e4b-59a6-4610-b624-05762ab9fc17) for task gpu-test of > framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 from @0.0.0.0:0 > E0819 18:14:26.092530 9346 slave.cpp:6357] Failed to update resources for > container fab468e6-bcbd-499c-9c24-ccd572c8317b of executor 'gpu-test' running > task gpu-test on status update for terminal task, destroying container: > Container not found > W0819 18:14:26.092737 9341 composing.cpp:614] Attempted to destroy unknown > container fab468e6-bcbd-499c-9c24-ccd572c8317b > I0819 18:14:26.092895 9331 task_status_update_manager.cpp:328] Received task > status update TASK_FAILED (Status UUID: 0abd4e4b-59a6-4610-b624-05762ab9fc17) > for task gpu-test of framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 > I0819 18:14:26.093626 9333 slave.cpp:6527] Forwarding the update TASK_FAILED > (Status UUID: 0abd4e4b-59a6-4610-b624-05762ab9fc17) for task gpu-test of > framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 to > master@192.168.10.192:5050 > I0819 18:14:26.102195 9342 slave.cpp:4310] Shutting down framework > d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 > I0819 18:14:26.102257 9342 slave.cpp:7218] Cleaning up executor 'gpu-test' of > framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 > I0819 18:14:26.102448 9332 gc.cpp:95] Scheduling > '/var/lib/mesos/slaves/d5cb56f3-1f2f-49e6-b63b-a401e445104d-S125/frameworks/d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027/executors/gpu-test/runs/fab468e6-bcbd-499c-9c24-ccd572c8317b' > for gc 6.988156days in the future > I0819 18:14:26.102600 9332 gc.cpp:95] Scheduling > '/var/lib/mesos/slaves/d5cb56f3-1f2f-49e6-b63b-a401e445104d-S125/frameworks/d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027/executors/gpu-test' > for gc 6.9881303111days in the future > I0819 18:14:26.102725 9342 slave.cpp:7347] Cleaning up framework > d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 > I0819 18:14:26.102805 9335 task_status_update_manager.cpp:289] Closing task > status update streams for framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 > I0819 18:14:26.102901 9342 gc.cpp:95] Scheduling > '/var/lib/mesos/slaves/d5cb56f3-1f2f-49e6-b63b-a401e445104d-S125/frameworks/d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027' > for gc 6.9881020741days in the future > I0819 18:14:34.385221 9334 http.cpp:1436] HTTP GET for > /files/browse?path=%2Fvar%2Flib%2Fmesos%2Fslaves%2Fd5cb56f3-1f2f-49e6-b63b-a401e445104d-S125=angular.callbacks._67 > from 192.168.110.142:11640 with User-Agent='Mozilla/5.0 (Windows NT 10.0; > Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 > Safari/537.36' > I0819 18:14:45.385519 9344 http.cpp:1436] HTTP GET for > /files/browse?path=%2Fvar%2Flib%2Fmesos%2Fslaves%2Fd5cb56f3-1f2f-49e6-b63b-a401e445104d-S125=angular.callbacks._6a > from 192.168.110.142:11690 with User-Agent='Mozilla/5.0 (Windows NT 10.0; > Win64; x64)
[jira] [Commented] (MESOS-10198) Mesos-master service is activating state
[ https://issues.apache.org/jira/browse/MESOS-10198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395290#comment-17395290 ] Charles Natali commented on MESOS-10198: Hi [~kiranjshetty], sorry for the delay, I know it's been a while. {noformat} Nov 12 08:36:49 servername mesos-master[20037]: mesos-master: ./db/skiplist.h:344: void leveldb::SkipList::Insert(const Key&) [with Key = const char*; Comparator = leveldb::MemTable::KeyComparator]: Assertion `x == __null || !Equal(key, x->key)' failed. {noformat} This points to a corruption of the on-disk leveldb database - it's been a long time, but do you remember if: - this specific error was present in all the masters logs? - did the hosts maybe crash prior to that? - I guess it's too late, but it would have been interesting to see the log of the first time the masters crashed Looking at our code, it's not clear to me what we could do to introduce a leveldb corruption - the only possibilities I can think of are a leveldb bug, or maybe in specific conditions some unrelated code ends up writing to the leveldb file descriptors, which could cause such a corruption. But having it occur across all masters seems very unlikely. > Mesos-master service is activating state > > > Key: MESOS-10198 > URL: https://issues.apache.org/jira/browse/MESOS-10198 > Project: Mesos > Issue Type: Task >Affects Versions: 1.9.0 >Reporter: Kiran J Shetty >Priority: Major > > mesos-master service showing activating state on all 3 master node and which > intern making marathon to restart frequently . in logs I can see below entry. > Mesos-master logs: > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a864206a9 > mesos::internal::log::ReplicaProcess::ReplicaProcess() > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a86420854 > mesos::internal::log::Replica::Replica() > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6a65 > mesos::internal::log::LogProcess::LogProcess() > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6e34 > mesos::log::Log::Log() > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a3ec72 main > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a8207 > __libc_start_main > Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a40d0a (unknown) > Nov 12 08:36:29 servername systemd[1]: mesos-master.service: main process > exited, code=killed, status=6/ABRT > Nov 12 08:36:29 servername systemd[1]: Unit mesos-master.service entered > failed state. > Nov 12 08:36:29 servername systemd[1]: mesos-master.service failed. > Nov 12 08:36:49 servername systemd[1]: mesos-master.service holdoff time > over, scheduling restart. > Nov 12 08:36:49 servername systemd[1]: Stopped Mesos Master. > Nov 12 08:36:49 servername systemd[1]: Started Mesos Master. > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.633597 20024 > logging.cpp:201] INFO level logging started! > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634446 20024 > main.cpp:243] Build: 2019-10-21 12:10:14 by centos > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634460 20024 > main.cpp:244] Version: 1.9.0 > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634466 20024 > main.cpp:247] Git tag: 1.9.0 > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634470 20024 > main.cpp:251] Git SHA: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e > Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.636653 20024 > main.cpp:345] Using 'hierarchical' allocator > Nov 12 08:36:49 servername mesos-master[20037]: mesos-master: > ./db/skiplist.h:344: void leveldb::SkipList::Insert(const > Key&) [with Key = const char*; Comparator = > leveldb::MemTable::KeyComparator]: Assertion `x == __null || !Equal(key, > x->key)' failed. > Nov 12 08:36:49 servername mesos-master[20037]: *** Aborted at 1605150409 > (unix time) try "date -d @1605150409" if you are using GNU date *** > Nov 12 08:36:49 servername mesos-master[20037]: PC: @ 0x7fdee16ed387 > __GI_raise > Nov 12 08:36:49 servername mesos-master[20037]: *** SIGABRT (@0x4e38) > received by PID 20024 (TID 0x7fdee720ea00) from PID 20024; stack trace: *** > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee1fb2630 (unknown) > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16ed387 __GI_raise > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16eea78 __GI_abort > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e61a6 > __assert_fail_base > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e6252 > __GI___assert_fail > Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cf3dc2 > leveldb::SkipList<>::Insert() > Nov 12 08:36:49 servername mesos-master[20037]: @
[jira] [Commented] (MESOS-10200) cmake target "install" not available in 1.10.x branch
[ https://issues.apache.org/jira/browse/MESOS-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391782#comment-17391782 ] Charles Natali commented on MESOS-10200: [~apeters] It's not quite clear to me, is it still a problem in master? > cmake target "install" not available in 1.10.x branch > - > > Key: MESOS-10200 > URL: https://issues.apache.org/jira/browse/MESOS-10200 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: 1.10.0 > Environment: OS: Mac OS X Catalina (10.15.7). >Reporter: PRUDHVI RAJ MULAGAPATI >Priority: Major > Attachments: 10198.html > > > I am trying to build mesos on Mac OS X 10.15.7 (Catalina) following the > official documentation. While in 1.10.x branch cmake target "install" is not > found. However I was able to build and install with 3.11.x and master > branches. Listed below are the available targets as shown by cmake --target > help. > > cmake --build . --target install > make: *** No rule to make target `install'. Stop. > > cmake --build . --target help > The following are some of the valid targets for this Makefile: > ... all (the default if no target is provided) > ... clean > ... depend > ... edit_cache > ... package > ... package_source > ... rebuild_cache > ... test > ... boost-1.65.0 > ... check > ... concurrentqueue-7b69a8f > ... csi_v0-0.2.0 > ... csi_v1-1.1.0 > ... dist > ... distcheck > ... elfio-3.2 > ... glog-0.4.0 > ... googletest-1.8.0 > ... grpc-1.10.0 > ... http_parser-2.6.2 > ... leveldb-1.19 > ... libarchive-3.3.2 > ... libev-4.22 > ... make_bin_include_dir > ... make_bin_java_dir > ... make_bin_jni_dir > ... make_bin_src_dir > ... nvml-352.79 > ... picojson-1.3.0 > ... protobuf-3.5.0 > ... rapidjson-1.1.0 > ... tests > ... zookeeper-3.4.8 > ... balloon-executor > ... balloon-framework > ... benchmarks > ... disk-full-framework > ... docker-no-executor-framework > ... dynamic-reservation-framework > ... example > ... examplemodule > ... fixed_resource_estimator > ... inverse-offer-framework > ... libprocess-tests > ... load-generator-framework > ... load_qos_controller > ... logrotate_container_logger > ... long-lived-executor > ... long-lived-framework > ... mesos > ... mesos-agent > ... mesos-cli > ... mesos-cni-port-mapper > ... mesos-containerizer > ... mesos-default-executor > ... mesos-docker-executor > ... mesos-execute > ... mesos-executor > ... mesos-fetcher > ... mesos-io-switchboard > ... mesos-local > ... mesos-log > ... mesos-logrotate-logger > ... mesos-master > ... mesos-protobufs > ... mesos-resolve > ... mesos-tcp-connect > ... mesos-tests > ... mesos-usage > ... no-executor-framework > ... operation-feedback-framework > ... persistent-volume-framework > ... process > ... stout-tests > ... test-csi-user-framework > ... test-executor > ... test-framework > ... test-helper > ... test-http-executor > ... test-http-framework > ... test-linkee > ... testallocator > ... testanonymous > ... testauthentication > ... testauthorizer > ... testcontainer_logger > ... testhook > ... testhttpauthenticator > ... testisolator > ... testmastercontender > ... testmasterdetector > ... testqos_controller > ... testresource_estimator > ... uri_disk_profile_adaptor -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10226) test suite hangs on ARM64
[ https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391736#comment-17391736 ] Charles Natali commented on MESOS-10226: Hm, it's annoying - the gdb backtrace you posted shows that the regtest gets stuck in this test, but for some reason running this test on its own isn't enough to reproduce it. It's going to be very difficult to debug without being able to run them myself. > test suite hangs on ARM64 > - > > Key: MESOS-10226 > URL: https://issues.apache.org/jira/browse/MESOS-10226 > Project: Mesos > Issue Type: Bug >Reporter: Charles Natali >Assignee: Charles Natali >Priority: Major > Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, > gdb-thread-apply-bt-all-29.07.2021.txt > > > Reported by [~mgrigorov]. > > {noformat} > [ RUN ] > NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace > sh: 1: hadoop: not found > Marked '/' as rslave > I0726 11:59:17.812630 32 exec.cpp:164] Version: 1.12.0 > I0726 11:59:17.827512 31 exec.cpp:237] Executor registered on agent > 9076f44b-846d-4f00-a2dc-11f694cc1900-S0 > I0726 11:59:17.830999 36 executor.cpp:190] Received SUBSCRIBED event > I0726 11:59:17.832351 36 executor.cpp:194] Subscribed executor on > martin-arm64 > I0726 11:59:17.832775 36 executor.cpp:190] Received LAUNCH event > I0726 11:59:17.834415 36 executor.cpp:722] Starting task > d1bbb266-bee7-4c9d-929f-16aa41f4e9cf > I0726 11:59:17.839910 36 executor.cpp:740] Forked command at 38 > Preparing rootfs at > '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791' > Changing root to > /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791 > Failed to execute 'sh': Exec format error > I0726 11:59:18.113488 33 executor.cpp:1041] Command exited with status 1 > (pid: 38) > ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: > Failure > Mock function called more times than expected - returning directly. > Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte > object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 > 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 > A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 > 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 > 03-00 00-00>) > Expected: to be called twice > Actual: called 3 times - over-saturated and active > I0726 11:59:19.117401 37 process.cpp:935] Stopped the socket accept > loop{noformat} > > I asked him to provide a gdb traceback and we can see the following: > > {noformat} > Thread 1 (Thread 0xa3bc2c60 (LWP 173475)): > #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 > "/tmp/7VXP3w/pipe", oflag=) at > ../sysdeps/unix/sysv/linux/open64.c:48 > #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, > filename=, posix_mode=, prot=prot@entry=438, > read_write=8, is32not64=) at fileops.c:189 > #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, > filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e > ntry=1) at fileops.c:281 > #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 > "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75 > #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at > ../../3rdparty/stout/include/stout/os/read.hpp:136 > #5 0xd74f1c1c in > mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody > (this=0xaaab00f88f50) at ../../src/tests/containeri > zer/nested_mesos_containerizer_tests.cpp:1126 > {noformat} > > > Basically the test uses a named pipe to synchronize with the task being > started, and if the task fails to start - in this case because we're trying > to launch an x86 container on an arm64 host - the test will just hang reading > from the pipe. > I send Martin a tentative fix for him to test, and I'll open an MR if > successful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10226) test suite hangs on ARM64
[ https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390152#comment-17390152 ] Charles Natali edited comment on MESOS-10226 at 7/29/21, 8:44 PM: -- Hm, I can't reproduce it. I updated the test to run the arm64 alpine image to cause it to fail in a similar way that it should be failing for you, and it's not hanging, but failing: {noformat} # ./bin/mesos-tests.sh --gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* [ RUN ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0 sh: 1: hadoop: not found Marked '/' as rslave I0729 21:40:16.121507 434157 exec.cpp:164] Version: 1.12.0 I0729 21:40:16.136072 434156 exec.cpp:237] Executor registered on agent 48863f87-f283-42ab-bd93-f301fdfbd73b-S0 I0729 21:40:16.139089 434154 executor.cpp:190] Received SUBSCRIBED event I0729 21:40:16.139974 434154 executor.cpp:194] Subscribed executor on thinkpad I0729 21:40:16.140264 434154 executor.cpp:190] Received LAUNCH event I0729 21:40:16.141703 434154 executor.cpp:722] Starting task 1461a266-1ead-4bdf-9165-9c0f6c5938b8 I0729 21:40:16.147071 434154 executor.cpp:740] Forked command at 434163 Preparing rootfs at '/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634' Changing root to /tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634 Failed to execute '/bin/ls': Exec format error I0729 21:40:16.321754 434155 executor.cpp:1041] Command exited with status 1 (pid: 434163) ../../src/tests/containerizer/provisioner_docker_tests.cpp:785: Failure Expected: TASK_FINISHED To be equal to: statusFinished->state() Which is: TASK_FAILED I0729 21:40:16.333557 434157 exec.cpp:478] Executor asked to shutdown I0729 21:40:16.334996 434158 executor.cpp:190] Received SHUTDOWN event I0729 21:40:16.335037 434158 executor.cpp:843] Shutting down [ FAILED ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where GetParam() = "arm64v8/alpine" (5851 ms) {noformat} Could you try running {noformat} ./bin/mesos-tests.sh --gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* --verbose {noformat} And see if it hangs, and post the result? Worst case we could just ignore the hang and update the test to use the arn64 image so it passes, but I'd like to understand why it hangs. was (Author: cf.natali): Hm, I can't reproduce it. I updated the test to run the arm64 alpine image to cause it to fail in a similar way that it should be failing for you, and it's not hanging, but failing: ``` # ./bin/mesos-tests.sh --gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* [ RUN ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0 sh: 1: hadoop: not found Marked '/' as rslave I0729 21:40:16.121507 434157 exec.cpp:164] Version: 1.12.0 I0729 21:40:16.136072 434156 exec.cpp:237] Executor registered on agent 48863f87-f283-42ab-bd93-f301fdfbd73b-S0 I0729 21:40:16.139089 434154 executor.cpp:190] Received SUBSCRIBED event I0729 21:40:16.139974 434154 executor.cpp:194] Subscribed executor on thinkpad I0729 21:40:16.140264 434154 executor.cpp:190] Received LAUNCH event I0729 21:40:16.141703 434154 executor.cpp:722] Starting task 1461a266-1ead-4bdf-9165-9c0f6c5938b8 I0729 21:40:16.147071 434154 executor.cpp:740] Forked command at 434163 Preparing rootfs at '/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634' Changing root to /tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634 Failed to execute '/bin/ls': Exec format error I0729 21:40:16.321754 434155 executor.cpp:1041] Command exited with status 1 (pid: 434163) ../../src/tests/containerizer/provisioner_docker_tests.cpp:785: Failure Expected: TASK_FINISHED To be equal to: statusFinished->state() Which is: TASK_FAILED I0729 21:40:16.333557 434157 exec.cpp:478] Executor asked to shutdown I0729 21:40:16.334996 434158 executor.cpp:190] Received SHUTDOWN event I0729 21:40:16.335037 434158 executor.cpp:843] Shutting down [ FAILED ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where GetParam() = "arm64v8/alpine" (5851 ms) ``` Could you try running ``` ./bin/mesos-tests.sh --gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* --verbose ``` And see if it hangs, and
[jira] [Commented] (MESOS-10226) test suite hangs on ARM64
[ https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390152#comment-17390152 ] Charles Natali commented on MESOS-10226: Hm, I can't reproduce it. I updated the test to run the arm64 alpine image to cause it to fail in a similar way that it should be failing for you, and it's not hanging, but failing: ``` # ./bin/mesos-tests.sh --gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* [ RUN ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0 sh: 1: hadoop: not found Marked '/' as rslave I0729 21:40:16.121507 434157 exec.cpp:164] Version: 1.12.0 I0729 21:40:16.136072 434156 exec.cpp:237] Executor registered on agent 48863f87-f283-42ab-bd93-f301fdfbd73b-S0 I0729 21:40:16.139089 434154 executor.cpp:190] Received SUBSCRIBED event I0729 21:40:16.139974 434154 executor.cpp:194] Subscribed executor on thinkpad I0729 21:40:16.140264 434154 executor.cpp:190] Received LAUNCH event I0729 21:40:16.141703 434154 executor.cpp:722] Starting task 1461a266-1ead-4bdf-9165-9c0f6c5938b8 I0729 21:40:16.147071 434154 executor.cpp:740] Forked command at 434163 Preparing rootfs at '/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634' Changing root to /tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634 Failed to execute '/bin/ls': Exec format error I0729 21:40:16.321754 434155 executor.cpp:1041] Command exited with status 1 (pid: 434163) ../../src/tests/containerizer/provisioner_docker_tests.cpp:785: Failure Expected: TASK_FINISHED To be equal to: statusFinished->state() Which is: TASK_FAILED I0729 21:40:16.333557 434157 exec.cpp:478] Executor asked to shutdown I0729 21:40:16.334996 434158 executor.cpp:190] Received SHUTDOWN event I0729 21:40:16.335037 434158 executor.cpp:843] Shutting down [ FAILED ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where GetParam() = "arm64v8/alpine" (5851 ms) ``` Could you try running ``` ./bin/mesos-tests.sh --gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* --verbose ``` And see if it hangs, and post the result? Worst case we could just ignore the hang and update the test to use the arn64 image so it passes, but I'd like to understand why it hangs. > test suite hangs on ARM64 > - > > Key: MESOS-10226 > URL: https://issues.apache.org/jira/browse/MESOS-10226 > Project: Mesos > Issue Type: Bug >Reporter: Charles Natali >Assignee: Charles Natali >Priority: Major > Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, > gdb-thread-apply-bt-all-29.07.2021.txt > > > Reported by [~mgrigorov]. > > {noformat} > [ RUN ] > NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace > sh: 1: hadoop: not found > Marked '/' as rslave > I0726 11:59:17.812630 32 exec.cpp:164] Version: 1.12.0 > I0726 11:59:17.827512 31 exec.cpp:237] Executor registered on agent > 9076f44b-846d-4f00-a2dc-11f694cc1900-S0 > I0726 11:59:17.830999 36 executor.cpp:190] Received SUBSCRIBED event > I0726 11:59:17.832351 36 executor.cpp:194] Subscribed executor on > martin-arm64 > I0726 11:59:17.832775 36 executor.cpp:190] Received LAUNCH event > I0726 11:59:17.834415 36 executor.cpp:722] Starting task > d1bbb266-bee7-4c9d-929f-16aa41f4e9cf > I0726 11:59:17.839910 36 executor.cpp:740] Forked command at 38 > Preparing rootfs at > '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791' > Changing root to > /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791 > Failed to execute 'sh': Exec format error > I0726 11:59:18.113488 33 executor.cpp:1041] Command exited with status 1 > (pid: 38) > ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: > Failure > Mock function called more times than expected - returning directly. > Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte > object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 > 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 > A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 > 20-BD 01-78 FF-FF
[jira] [Comment Edited] (MESOS-10226) test suite hangs on ARM64
[ https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390058#comment-17390058 ] Charles Natali edited comment on MESOS-10226 at 7/29/21, 6:09 PM: -- [~mgrigorov] Looking at the code corresponding to the backtrace, I don't think it should hang foreverm but only up to 10 minutes: {noformat} #13 0xb7ca1418 in AwaitAssertReady (expr=0xba1c1d58 "statusStarting", actual=..., duration=...) at ../../3rdparty/libprocess/include/process/gtest.hpp:126 #14 0xb97c588c in mesos::internal::tests::ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_Test::TestBody (this=0xcd4207a0) at ../../src/tests/containerizer/provisioner_docker_tests.cpp:782 {noformat} {noformat} AWAIT_READY_FOR(statusStarting, Minutes(10));{noformat} Are you sure it was stuck indefinitely and not just taking a long time? Also, it would help to have the output of running the tests with {{--verbose}}. was (Author: cf.natali): [~mgrigorov] Looking at the code corresponding to the backtrace, I don't think it should hang foreverm but only up to 10 minutes: {noformat} #13 0xb7ca1418 in AwaitAssertReady (expr=0xba1c1d58 "statusStarting", actual=..., duration=...) at ../../3rdparty/libprocess/include/process/gtest.hpp:126 #14 0xb97c588c in mesos::internal::tests::ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_Test::TestBody (this=0xcd4207a0) at ../../src/tests/containerizer/provisioner_docker_tests.cpp:782 {noformat} {noformat} AWAIT_READY_FOR(statusStarting, Minutes(10));{noformat} Are you sure it was stuck indefinitely and not just taking a long time? > test suite hangs on ARM64 > - > > Key: MESOS-10226 > URL: https://issues.apache.org/jira/browse/MESOS-10226 > Project: Mesos > Issue Type: Bug >Reporter: Charles Natali >Assignee: Charles Natali >Priority: Major > Attachments: gdb-thread-apply-bt-all-29.07.2021.txt > > > Reported by [~mgrigorov]. > > {noformat} > [ RUN ] > NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace > sh: 1: hadoop: not found > Marked '/' as rslave > I0726 11:59:17.812630 32 exec.cpp:164] Version: 1.12.0 > I0726 11:59:17.827512 31 exec.cpp:237] Executor registered on agent > 9076f44b-846d-4f00-a2dc-11f694cc1900-S0 > I0726 11:59:17.830999 36 executor.cpp:190] Received SUBSCRIBED event > I0726 11:59:17.832351 36 executor.cpp:194] Subscribed executor on > martin-arm64 > I0726 11:59:17.832775 36 executor.cpp:190] Received LAUNCH event > I0726 11:59:17.834415 36 executor.cpp:722] Starting task > d1bbb266-bee7-4c9d-929f-16aa41f4e9cf > I0726 11:59:17.839910 36 executor.cpp:740] Forked command at 38 > Preparing rootfs at > '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791' > Changing root to > /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791 > Failed to execute 'sh': Exec format error > I0726 11:59:18.113488 33 executor.cpp:1041] Command exited with status 1 > (pid: 38) > ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: > Failure > Mock function called more times than expected - returning directly. > Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte > object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 > 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 > A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 > 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 > 03-00 00-00>) > Expected: to be called twice > Actual: called 3 times - over-saturated and active > I0726 11:59:19.117401 37 process.cpp:935] Stopped the socket accept > loop{noformat} > > I asked him to provide a gdb traceback and we can see the following: > > {noformat} > Thread 1 (Thread 0xa3bc2c60 (LWP 173475)): > #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 > "/tmp/7VXP3w/pipe", oflag=) at > ../sysdeps/unix/sysv/linux/open64.c:48 > #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, > filename=, posix_mode=, prot=prot@entry=438, > read_write=8, is32not64=) at fileops.c:189 > #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, > filename=filename@entry=0xaaab00f342e0
[jira] [Commented] (MESOS-10226) test suite hangs on ARM64
[ https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390058#comment-17390058 ] Charles Natali commented on MESOS-10226: [~mgrigorov] Looking at the code corresponding to the backtrace, I don't think it should hang foreverm but only up to 10 minutes: {noformat} #13 0xb7ca1418 in AwaitAssertReady (expr=0xba1c1d58 "statusStarting", actual=..., duration=...) at ../../3rdparty/libprocess/include/process/gtest.hpp:126 #14 0xb97c588c in mesos::internal::tests::ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_Test::TestBody (this=0xcd4207a0) at ../../src/tests/containerizer/provisioner_docker_tests.cpp:782 {noformat} {noformat} AWAIT_READY_FOR(statusStarting, Minutes(10));{noformat} Are you sure it was stuck indefinitely and not just taking a long time? > test suite hangs on ARM64 > - > > Key: MESOS-10226 > URL: https://issues.apache.org/jira/browse/MESOS-10226 > Project: Mesos > Issue Type: Bug >Reporter: Charles Natali >Assignee: Charles Natali >Priority: Major > Attachments: gdb-thread-apply-bt-all-29.07.2021.txt > > > Reported by [~mgrigorov]. > > {noformat} > [ RUN ] > NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace > sh: 1: hadoop: not found > Marked '/' as rslave > I0726 11:59:17.812630 32 exec.cpp:164] Version: 1.12.0 > I0726 11:59:17.827512 31 exec.cpp:237] Executor registered on agent > 9076f44b-846d-4f00-a2dc-11f694cc1900-S0 > I0726 11:59:17.830999 36 executor.cpp:190] Received SUBSCRIBED event > I0726 11:59:17.832351 36 executor.cpp:194] Subscribed executor on > martin-arm64 > I0726 11:59:17.832775 36 executor.cpp:190] Received LAUNCH event > I0726 11:59:17.834415 36 executor.cpp:722] Starting task > d1bbb266-bee7-4c9d-929f-16aa41f4e9cf > I0726 11:59:17.839910 36 executor.cpp:740] Forked command at 38 > Preparing rootfs at > '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791' > Changing root to > /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791 > Failed to execute 'sh': Exec format error > I0726 11:59:18.113488 33 executor.cpp:1041] Command exited with status 1 > (pid: 38) > ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: > Failure > Mock function called more times than expected - returning directly. > Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte > object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 > 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 > A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 > 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 > 03-00 00-00>) > Expected: to be called twice > Actual: called 3 times - over-saturated and active > I0726 11:59:19.117401 37 process.cpp:935] Stopped the socket accept > loop{noformat} > > I asked him to provide a gdb traceback and we can see the following: > > {noformat} > Thread 1 (Thread 0xa3bc2c60 (LWP 173475)): > #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 > "/tmp/7VXP3w/pipe", oflag=) at > ../sysdeps/unix/sysv/linux/open64.c:48 > #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, > filename=, posix_mode=, prot=prot@entry=438, > read_write=8, is32not64=) at fileops.c:189 > #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, > filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e > ntry=1) at fileops.c:281 > #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 > "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75 > #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at > ../../3rdparty/stout/include/stout/os/read.hpp:136 > #5 0xd74f1c1c in > mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody > (this=0xaaab00f88f50) at ../../src/tests/containeri > zer/nested_mesos_containerizer_tests.cpp:1126 > {noformat} > > > Basically the test uses a named pipe to synchronize with the task being > started, and if the task fails to start - in this case because we're trying > to launch an x86 container on an arm64
[jira] [Commented] (MESOS-10226) test suite hangs on ARM64
[ https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390055#comment-17390055 ] Charles Natali commented on MESOS-10226: Thanks, I'll have a look - I hope there won't be too many hanging tests... > test suite hangs on ARM64 > - > > Key: MESOS-10226 > URL: https://issues.apache.org/jira/browse/MESOS-10226 > Project: Mesos > Issue Type: Bug >Reporter: Charles Natali >Assignee: Charles Natali >Priority: Major > Attachments: gdb-thread-apply-bt-all-29.07.2021.txt > > > Reported by [~mgrigorov]. > > {noformat} > [ RUN ] > NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace > sh: 1: hadoop: not found > Marked '/' as rslave > I0726 11:59:17.812630 32 exec.cpp:164] Version: 1.12.0 > I0726 11:59:17.827512 31 exec.cpp:237] Executor registered on agent > 9076f44b-846d-4f00-a2dc-11f694cc1900-S0 > I0726 11:59:17.830999 36 executor.cpp:190] Received SUBSCRIBED event > I0726 11:59:17.832351 36 executor.cpp:194] Subscribed executor on > martin-arm64 > I0726 11:59:17.832775 36 executor.cpp:190] Received LAUNCH event > I0726 11:59:17.834415 36 executor.cpp:722] Starting task > d1bbb266-bee7-4c9d-929f-16aa41f4e9cf > I0726 11:59:17.839910 36 executor.cpp:740] Forked command at 38 > Preparing rootfs at > '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791' > Changing root to > /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791 > Failed to execute 'sh': Exec format error > I0726 11:59:18.113488 33 executor.cpp:1041] Command exited with status 1 > (pid: 38) > ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: > Failure > Mock function called more times than expected - returning directly. > Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte > object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 > 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 > A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 > 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 > 03-00 00-00>) > Expected: to be called twice > Actual: called 3 times - over-saturated and active > I0726 11:59:19.117401 37 process.cpp:935] Stopped the socket accept > loop{noformat} > > I asked him to provide a gdb traceback and we can see the following: > > {noformat} > Thread 1 (Thread 0xa3bc2c60 (LWP 173475)): > #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 > "/tmp/7VXP3w/pipe", oflag=) at > ../sysdeps/unix/sysv/linux/open64.c:48 > #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, > filename=, posix_mode=, prot=prot@entry=438, > read_write=8, is32not64=) at fileops.c:189 > #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, > filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e > ntry=1) at fileops.c:281 > #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 > "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75 > #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at > ../../3rdparty/stout/include/stout/os/read.hpp:136 > #5 0xd74f1c1c in > mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody > (this=0xaaab00f88f50) at ../../src/tests/containeri > zer/nested_mesos_containerizer_tests.cpp:1126 > {noformat} > > > Basically the test uses a named pipe to synchronize with the task being > started, and if the task fails to start - in this case because we're trying > to launch an x86 container on an arm64 host - the test will just hang reading > from the pipe. > I send Martin a tentative fix for him to test, and I'll open an MR if > successful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10226) test suite hangs on ARM64
Charles Natali created MESOS-10226: -- Summary: test suite hangs on ARM64 Key: MESOS-10226 URL: https://issues.apache.org/jira/browse/MESOS-10226 Project: Mesos Issue Type: Bug Reporter: Charles Natali Assignee: Charles Natali Reported by [~mgrigorov]. {noformat} [ RUN ] NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace sh: 1: hadoop: not found Marked '/' as rslave I0726 11:59:17.812630 32 exec.cpp:164] Version: 1.12.0 I0726 11:59:17.827512 31 exec.cpp:237] Executor registered on agent 9076f44b-846d-4f00-a2dc-11f694cc1900-S0 I0726 11:59:17.830999 36 executor.cpp:190] Received SUBSCRIBED event I0726 11:59:17.832351 36 executor.cpp:194] Subscribed executor on martin-arm64 I0726 11:59:17.832775 36 executor.cpp:190] Received LAUNCH event I0726 11:59:17.834415 36 executor.cpp:722] Starting task d1bbb266-bee7-4c9d-929f-16aa41f4e9cf I0726 11:59:17.839910 36 executor.cpp:740] Forked command at 38 Preparing rootfs at '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791' Changing root to /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791 Failed to execute 'sh': Exec format error I0726 11:59:18.113488 33 executor.cpp:1041] Command exited with status 1 (pid: 38) ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: Failure Mock function called more times than expected - returning directly. Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 03-00 00-00>) Expected: to be called twice Actual: called 3 times - over-saturated and active I0726 11:59:19.117401 37 process.cpp:935] Stopped the socket accept loop{noformat} I asked him to provide a gdb traceback and we can see the following: {noformat} Thread 1 (Thread 0xa3bc2c60 (LWP 173475)): #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 "/tmp/7VXP3w/pipe", oflag=) at ../sysdeps/unix/sysv/linux/open64.c:48 #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, filename=, posix_mode=, prot=prot@entry=438, read_write=8, is32not64=) at fileops.c:189 #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode=, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e ntry=1) at fileops.c:281 #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75 #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at ../../3rdparty/stout/include/stout/os/read.hpp:136 #5 0xd74f1c1c in mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody (this=0xaaab00f88f50) at ../../src/tests/containeri zer/nested_mesos_containerizer_tests.cpp:1126 {noformat} Basically the test uses a named pipe to synchronize with the task being started, and if the task fails to start - in this case because we're trying to launch an x86 container on an arm64 host - the test will just hang reading from the pipe. I send Martin a tentative fix for him to test, and I'll open an MR if successful. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9352) Data in persistent volume deleted accidentally when using Docker container and Persistent volume
[ https://issues.apache.org/jira/browse/MESOS-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384479#comment-17384479 ] Charles Natali commented on MESOS-9352: --- If it's fixed feel free to close! > Data in persistent volume deleted accidentally when using Docker container > and Persistent volume > > > Key: MESOS-9352 > URL: https://issues.apache.org/jira/browse/MESOS-9352 > Project: Mesos > Issue Type: Bug > Components: agent, containerization, docker >Affects Versions: 1.5.1, 1.5.2 > Environment: DCOS 1.11.6 > Mesos 1.5.2 >Reporter: David Ko >Assignee: Joseph Wu >Priority: Critical > Labels: dcos, dcos-1.11.6, mesosphere, persistent-volumes > Attachments: image-2018-10-24-22-20-51-059.png, > image-2018-10-24-22-21-13-399.png > > > Using docker image w/ persistent volume to start a service, it will cause > data in persistent volume deleted accidentally when task killed and > restarted, also old mount points not unmounted, even the service already > deleted. > *The expected result should be data in persistent volume kept until task > deleted completely, also dangling mount points should be unmounted correctly.* > > *Step 1:* Use below JSON config to create a Mysql server using Docker image > and Persistent Volume > {code:javascript} > { > "env": { > "MYSQL_USER": "wordpress", > "MYSQL_PASSWORD": "secret", > "MYSQL_ROOT_PASSWORD": "supersecret", > "MYSQL_DATABASE": "wordpress" > }, > "id": "/mysqlgc", > "backoffFactor": 1.15, > "backoffSeconds": 1, > "constraints": [ > [ > "hostname", > "IS", > "172.27.12.216" > ] > ], > "container": { > "portMappings": [ > { > "containerPort": 3306, > "hostPort": 0, > "protocol": "tcp", > "servicePort": 1 > } > ], > "type": "DOCKER", > "volumes": [ > { > "persistent": { > "type": "root", > "size": 1000, > "constraints": [] > }, > "mode": "RW", > "containerPath": "mysqldata" > }, > { > "containerPath": "/var/lib/mysql", > "hostPath": "mysqldata", > "mode": "RW" > } > ], > "docker": { > "image": "mysql", > "forcePullImage": false, > "privileged": false, > "parameters": [] > } > }, > "cpus": 1, > "disk": 0, > "instances": 1, > "maxLaunchDelaySeconds": 3600, > "mem": 512, > "gpus": 0, > "networks": [ > { > "mode": "container/bridge" > } > ], > "residency": { > "relaunchEscalationTimeoutSeconds": 3600, > "taskLostBehavior": "WAIT_FOREVER" > }, > "requirePorts": false, > "upgradeStrategy": { > "maximumOverCapacity": 0, > "minimumHealthCapacity": 0 > }, > "killSelection": "YOUNGEST_FIRST", > "unreachableStrategy": "disabled", > "healthChecks": [], > "fetch": [] > } > {code} > *Step 2:* Kill mysqld process to force rescheduling new Mysql task, but found > 2 mount points to the same persistent volume, it means old mount point did > not be unmounted immediately. > !image-2018-10-24-22-20-51-059.png! > *Step 3:* After GC, data in persistent volume was deleted accidentally, but > mysqld (Mesos task) still running > !image-2018-10-24-22-21-13-399.png! > *Step 4:* Delete Mysql service from Marathon, all mount points unable to > unmount, even the service already deleted. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10223) Crashes on ARM64 due to bad interaction of libunwind with libgcc.
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372475#comment-17372475 ] Charles Natali commented on MESOS-10223: It must be a different issue then. Could you run {noformat} # ./bin/mesos-tests.sh --verbose > mesos-tests.log 2>&1{noformat} And post the result? > Crashes on ARM64 due to bad interaction of libunwind with libgcc. > -- > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Assignee: Charles Natali >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz, sudo_make_check_output.txt > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @ 0x91353dac abort > @ 0x91569848 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @ 0x915675b0 __cxa_rethrow > @
[jira] [Commented] (MESOS-10223) Crashes on ARM64 due to bad interaction of libunwind with libgcc.
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371617#comment-17371617 ] Charles Natali commented on MESOS-10223: [~mgrigorov] The hang should be fixed in master - it'd be great if you could give it a try. > Crashes on ARM64 due to bad interaction of libunwind with libgcc. > -- > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Assignee: Charles Natali >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz, sudo_make_check_output.txt > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @ 0x91353dac abort > @ 0x91569848 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @ 0x915675b0 __cxa_rethrow > @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler() > @
[jira] [Commented] (MESOS-10225) mention that systemd agent unit should have Delegate=yes
[ https://issues.apache.org/jira/browse/MESOS-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370808#comment-17370808 ] Charles Natali commented on MESOS-10225: Good question - I think having a dedicated section might be better, maybe "Interaction with systemd" or something like that? > mention that systemd agent unit should have Delegate=yes > > > Key: MESOS-10225 > URL: https://issues.apache.org/jira/browse/MESOS-10225 > Project: Mesos > Issue Type: Documentation >Reporter: Charles Natali >Assignee: Andreas Peters >Priority: Major > > If managed by systemd, the agent unit should have > [Delegate=yes|https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=] > to prevent systemd from manipulating cgroups created by the agent, which can > break things quite badly. > See for example https://issues.apache.org/jira/browse/MESOS-3488 and > https://issues.apache.org/jira/browse/MESOS-3009 for the kind of problems it > causes. > I think it's quite important and should figure in good place in the > documentation, maybe in the agent configuration page > [http://mesos.apache.org/documentation/latest/configuration/agent/] ? > > [~surahman] or [~apeters] if either one of you wants to have a look at it, I > think it's important that at least someone is familiar with the documentation > part. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10225) mention that systemd agent unit should have Delegate=yes
[ https://issues.apache.org/jira/browse/MESOS-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370208#comment-17370208 ] Charles Natali commented on MESOS-10225: Thanks Andreas, that'd be great - hopefully will avoid some surprises to users. > mention that systemd agent unit should have Delegate=yes > > > Key: MESOS-10225 > URL: https://issues.apache.org/jira/browse/MESOS-10225 > Project: Mesos > Issue Type: Documentation >Reporter: Charles Natali >Assignee: Andreas Peters >Priority: Major > > If managed by systemd, the agent unit should have > [Delegate=yes|https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=] > to prevent systemd from manipulating cgroups created by the agent, which can > break things quite badly. > See for example https://issues.apache.org/jira/browse/MESOS-3488 and > https://issues.apache.org/jira/browse/MESOS-3009 for the kind of problems it > causes. > I think it's quite important and should figure in good place in the > documentation, maybe in the agent configuration page > [http://mesos.apache.org/documentation/latest/configuration/agent/] ? > > [~surahman] or [~apeters] if either one of you wants to have a look at it, I > think it's important that at least someone is familiar with the documentation > part. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (MESOS-10225) mention that systemd agent unit should have Delegate=yes
Charles Natali created MESOS-10225: -- Summary: mention that systemd agent unit should have Delegate=yes Key: MESOS-10225 URL: https://issues.apache.org/jira/browse/MESOS-10225 Project: Mesos Issue Type: Documentation Reporter: Charles Natali If managed by systemd, the agent unit should have [Delegate=yes|https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=] to prevent systemd from manipulating cgroups created by the agent, which can break things quite badly. See for example https://issues.apache.org/jira/browse/MESOS-3488 and https://issues.apache.org/jira/browse/MESOS-3009 for the kind of problems it causes. I think it's quite important and should figure in good place in the documentation, maybe in the agent configuration page [http://mesos.apache.org/documentation/latest/configuration/agent/] ? [~surahman] or [~apeters] if either one of you wants to have a look at it, I think it's important that at least someone is familiar with the documentation part. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10129) Build fails on Maven javadoc generation when using JDK11
[ https://issues.apache.org/jira/browse/MESOS-10129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369024#comment-17369024 ] Charles Natali commented on MESOS-10129: Hey [~csaltos] , sorry for the long delay. [~surahman] could you maybe have a look at this? I usually build with {{--disable-java}} but since [~csaltos] provided an MR and the fix is one-line, it'd be good to merge. Maybe just try and reproduce the problem on your machine? > Build fails on Maven javadoc generation when using JDK11 > > > Key: MESOS-10129 > URL: https://issues.apache.org/jira/browse/MESOS-10129 > Project: Mesos > Issue Type: Bug > Components: build >Affects Versions: master, 1.10.0 > Environment: Debian 10 Buster (2020-04-29) with OpenJdk 11.0.7 > (2020-04-14) >Reporter: Carlos Saltos >Priority: Major > Labels: Java11, beginner, build, java11, jdk11 > Attachments: mesos.10.0.maven.javadoc.fix.patch > > > h3. CURRENT BEHAVIOR: > When using Java 11 (or newer versions) the Javadoc generation step fails with > the error: > {{[ERROR] Failed to execute goal > org.apache.maven.plugins:maven-javadoc-plugin:2.8.1:jar > (build-and-attach-javadocs) on project mesos: MavenReportException: Error > while creating archive:}} > {{[ERROR] Exit code: 1 - javadoc: error - The code being documented uses > modules but the packages defined in > http://download.oracle.com/javase/6/docs/api/ are in the unnamed module.}} > {{[ERROR]}} > {{[ERROR] Command line was: /usr/lib/jvm/java-11-openjdk-amd64/bin/javadoc > @options}} > {{[ERROR]}} > {{[ERROR] Refer to the generated Javadoc files in > '/home/admin/mesos-deb-packaging/mesos-repo/build/src/java/target/apidocs' > dir.}} > {{[ERROR] -> [Help 1]}} > {{[ERROR]}} > {{[ERROR] To see the full stack trace of the errors, re-run Maven with the -e > switch.}} > {{[ERROR] Re-run Maven using the -X switch to enable full debug logging.}} > {{[ERROR]}} > {{[ERROR] For more information about the errors and possible solutions, > please read the following articles:}} > {{[ERROR] [Help 1] > http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException}} > {{make[1]: *** [Makefile:17533: java/target/mesos-1.11.0.jar] Error 1}} > {{make[1]: Leaving directory > '/home/admin/mesos-deb-packaging/mesos-repo/build/src'}} > {{make: *** [Makefile:785: all-recursive] Error 1}} > *NOTE:* The error is at the Maven javadoc plugin call when it tries to > include references to the non-existant old Java 6 documentation. > h3. POSSIBLE SOLUTION: > Just remove the old reference with adding > false to the javadoc maven plugin > configuration section -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9950) memory cgroup gone before isolator cleaning up
[ https://issues.apache.org/jira/browse/MESOS-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369023#comment-17369023 ] Charles Natali commented on MESOS-9950: --- [~subhajitpalit] So did you check the systemd configuration? > memory cgroup gone before isolator cleaning up > -- > > Key: MESOS-9950 > URL: https://issues.apache.org/jira/browse/MESOS-9950 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: longfei >Priority: Major > > The memcg created by mesos may have been deleted before cgroup/memory > isolator cleaning up. > This would let the termination fail and lose information in the old > termination(before fail). > {code:java} > I0821 15:16:03.025796 3354800 paths.cpp:745] Creating sandbox > '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89' > for user 'tiger' > I0821 15:16:03.026199 3354800 paths.cpp:748] Creating sandbox > '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/meta/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89' > I0821 15:16:03.026304 3354800 slave.cpp:9064] Launching executor > 'mt:z03584687:1' of framework > 8e4967e5-736e-4a22-90c3-7b32d526914d- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] > in work directory > '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89' > I0821 15:16:03.051795 3354800 slave.cpp:3520] Launching container > a0706ca0-fe2c-4477-8161-329b26ea5d89 for executor > 'mt:z03584687:1' of framework > 8e4967e5-736e-4a22-90c3-7b32d526914d- > I0821 15:16:03.076608 3354807 containerizer.cpp:1325] Starting container > a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.076911 3354807 containerizer.cpp:3185] Transitioning the state > of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from PROVISIONING to > PREPARING > I0821 15:16:03.077906 3354802 memory.cpp:478] Started listening for OOM > events for container a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.079540 3354804 memory.cpp:198] Updated > 'memory.soft_limit_in_bytes' to 4032MB for container > a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.079587 3354820 cpu.cpp:92] Updated 'cpu.shares' to 1126 (cpus > 1.1) for container a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.079589 3354804 memory.cpp:227] Updated 'memory.limit_in_bytes' > to 4032MB for container a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.080901 3354802 switchboard.cpp:316] Container logger module > finished preparing container a0706ca0-fe2c-4477-8161-329b26ea5d89; > IOSwitchboard server is not required > I0821 15:16:03.081593 3354801 linux_launcher.cpp:492] Launching container > a0706ca0-fe2c-4477-8161-329b26ea5d89 and cloning with namespaces > I0821 15:16:03.083823 3354808 containerizer.cpp:2107] Checkpointing > container's forked pid 1857418 to > '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/meta/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89/pids/forked.pid' > I0821 15:16:03.084156 3354808 containerizer.cpp:3185] Transitioning the state > of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from PREPARING to ISOLATING > I0821 15:16:03.091468 3354808 containerizer.cpp:3185] Transitioning the state > of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from ISOLATING to FETCHING > I0821 15:16:03.094933 3354808 containerizer.cpp:3185] Transitioning the state > of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from FETCHING to RUNNING > I0821 15:16:03.197753 3354808 memory.cpp:198] Updated > 'memory.soft_limit_in_bytes' to 4032MB for container > a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.197757 3354801 cpu.cpp:92] Updated 'cpu.shares' to 1126 (cpus > 1.1) for container a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:21:39.692978 3354814 memory.cpp:515] OOM detected for container > a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:21:39.693182 3354805 containerizer.cpp:3044] Container > a0706ca0-fe2c-4477-8161-329b26ea5d89 has reached its limit for resource [] > and will be terminated > I0821 15:21:39.693192 3354805 containerizer.cpp:2518] Destroying
[jira] [Assigned] (MESOS-10223) Crashes on ARM64 due to bad interaction of libunwind with libgcc.
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Natali reassigned MESOS-10223: -- Assignee: Charles Natali Summary: Crashes on ARM64 due to bad interaction of libunwind with libgcc. (was: Test failures on Linux ARM64) Thanks [~mgrigorov] , unfortunately those logs aren't really helpful because they just show that the test hangs, but don't show which test. The actual log for the tests can be obtained by running e.g.: {noformat} # ./bin/mesos-tests.sh --verbose{noformat} Note that I can actually reproduce this hang with master on my machine, so it is very likely unrelated to this problem and not ARM64-specific. I'll try to address in a separate issue. I've created a PR for the original crash: https://github.com/apache/mesos/pull/395 > Crashes on ARM64 due to bad interaction of libunwind with libgcc. > -- > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Assignee: Charles Natali >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz, sudo_make_check_output.txt > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d
[jira] [Comment Edited] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368370#comment-17368370 ] Charles Natali edited comment on MESOS-10223 at 6/23/21, 6:00 PM: -- bq. After running for 3 hours make check failed on two shards with: Yeah this error is fine and unrelated. Did the root one finish? was (Author: cf.natali): Yeah this error is fine and unrelated. Did the root one finish? On Wed, 23 Jun 2021, 14:23 Martin Tzvetanov Grigorov (Jira), < > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @ 0x91353dac abort > @ 0x91569848 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @
[jira] [Comment Edited] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368371#comment-17368371 ] Charles Natali edited comment on MESOS-10223 at 6/23/21, 6:00 PM: -- {quote}I may found another issue today. I tried to build Mesos with *make -j $(nproc)*, i.e. 8, and it failed with: {quote} You're probably running out of memory when the build parallelism is too high, the compilation is quite memory intensive. was (Author: cf.natali): You're probably running out of memory when the build parallelism is too high, the compilation is quite memory intensive. On Wed, 23 Jun 2021, 11:32 Martin Tzvetanov Grigorov (Jira), < > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @
[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
[ https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368386#comment-17368386 ] Charles Natali commented on MESOS-10224: I'd go for the last option, i.e. return error only if the data pointer is past the end of the buffer. > What are your thoughts? All of the above are quick adjustments but they > weaken the original checks. Yes but it's fine, since the original check doesn't work anymore :). > [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails. > - > > Key: MESOS-10224 > URL: https://issues.apache.org/jira/browse/MESOS-10224 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.11.0 >Reporter: Saad Ur Rahman >Priority: Major > Attachments: ld.so.cache > > > *OS:* Ubuntu 21.04 > *Command:* > {code:java} > make -j 6 V=0 check{code} > Fails during the build and test suite run on two different machines with the > same OS. > {code:java} > 3: [ OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms) > 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest > (27011 ms total) > 3: > 3: [--] Global test environment tear-down > 3: [==] 575 tests from 178 test cases ran. (202572 ms total) > 3: [ PASSED ] 573 tests. > 3: [ FAILED ] 2 tests, listed below: > 3: [ FAILED ] LdcacheTest.Parse > 3: [ FAILED ] > CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where > GetParam() = "v0" > 3: > 3: 2 FAILED TESTS > 3: YOU HAVE 34 DISABLED TESTS > 3: > 3: > 3: > 3: [FAIL]: 4 shard(s) have failed tests > 3/3 Test #3: MesosTests ...***Failed 1173.43 sec > {code} > Are there any pre-requisites required to get the build/tests to pass? I am > trying to get all the tests to pass to make sure my build environment is > setup correctly for development. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368370#comment-17368370 ] Charles Natali commented on MESOS-10223: Yeah this error is fine and unrelated. Did the root one finish? On Wed, 23 Jun 2021, 14:23 Martin Tzvetanov Grigorov (Jira), < > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @ 0x91353dac abort > @ 0x91569848 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @ 0x915675b0 __cxa_rethrow > @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @ 0x91567544
[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368371#comment-17368371 ] Charles Natali commented on MESOS-10223: You're probably running out of memory when the build parallelism is too high, the compilation is quite memory intensive. On Wed, 23 Jun 2021, 11:32 Martin Tzvetanov Grigorov (Jira), < > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @ 0x91353dac abort > @ 0x91569848 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @ 0x915675b0 __cxa_rethrow > @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @
[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
[ https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367602#comment-17367602 ] Charles Natali commented on MESOS-10224: Actually [~surahman] you should go ahead, it's a nice and easy fix! The problematic code is here: https://github.com/apache/mesos/blob/master/src/linux/ldcache.cpp#L227 The code expects that the file ends after the last entry, whereas in your case it's not true since there's this description string at the end of the file. > [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails. > - > > Key: MESOS-10224 > URL: https://issues.apache.org/jira/browse/MESOS-10224 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.11.0 >Reporter: Saad Ur Rahman >Priority: Major > Attachments: ld.so.cache > > > *OS:* Ubuntu 21.04 > *Command:* > {code:java} > make -j 6 V=0 check{code} > Fails during the build and test suite run on two different machines with the > same OS. > {code:java} > 3: [ OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms) > 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest > (27011 ms total) > 3: > 3: [--] Global test environment tear-down > 3: [==] 575 tests from 178 test cases ran. (202572 ms total) > 3: [ PASSED ] 573 tests. > 3: [ FAILED ] 2 tests, listed below: > 3: [ FAILED ] LdcacheTest.Parse > 3: [ FAILED ] > CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where > GetParam() = "v0" > 3: > 3: 2 FAILED TESTS > 3: YOU HAVE 34 DISABLED TESTS > 3: > 3: > 3: > 3: [FAIL]: 4 shard(s) have failed tests > 3/3 Test #3: MesosTests ...***Failed 1173.43 sec > {code} > Are there any pre-requisites required to get the build/tests to pass? I am > trying to get all the tests to pass to make sure my build environment is > setup correctly for development. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
[ https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367584#comment-17367584 ] Charles Natali commented on MESOS-10224: Ah, here's the problem, looks like Ubuntu adds some crap at the end of the cache. Let's look at the end of the file. Mine - Debian - ends with an entry and then the NUL byte: {noformat} cf@thinkpad:~/src/mesos$ hexdump -C /etc/ld.so.cache.default | tail 00014d20 36 00 2f 75 73 72 2f 6c 69 62 2f 6c 69 62 42 4c |6./usr/lib/libBL| 00014d30 54 6c 69 74 65 2e 32 2e 35 2e 73 6f 2e 38 2e 36 |Tlite.2.5.so.8.6| 00014d40 00 6c 69 62 42 4c 54 2e 32 2e 35 2e 73 6f 2e 38 |.libBLT.2.5.so.8| 00014d50 2e 36 00 2f 75 73 72 2f 6c 69 62 2f 6c 69 62 42 |.6./usr/lib/libB| 00014d60 4c 54 2e 32 2e 35 2e 73 6f 2e 38 2e 36 00 6c 64 |LT.2.5.so.8.6.ld| 00014d70 2d 6c 69 6e 75 78 2d 78 38 36 2d 36 34 2e 73 6f |-linux-x86-64.so| 00014d80 2e 32 00 2f 6c 69 62 2f 78 38 36 5f 36 34 2d 6c |.2./lib/x86_64-l| 00014d90 69 6e 75 78 2d 67 6e 75 2f 6c 64 2d 6c 69 6e 75 |inux-gnu/ld-linu| 00014da0 78 2d 78 38 36 2d 36 34 2e 73 6f 2e 32 00|x-x86-64.so.2.| 00014dae {noformat} Yours - ends with some random strings at the end: {noformat} cf@thinkpad:~/src/mesos$ hexdump -C /etc/ld.so.cache | tail 000130c0 6f 2e 30 2e 30 00 2f 6c 69 62 2f 78 38 36 5f 36 |o.0.0./lib/x86_6| 000130d0 34 2d 6c 69 6e 75 78 2d 67 6e 75 2f 6c 69 62 67 |4-linux-gnu/libg| 000130e0 63 69 2d 31 2e 73 6f 2e 30 2e 30 2e 30 00 00 00 |ci-1.so.0.0.0...| 000130f0 74 21 a4 ea 01 00 00 00 00 00 00 00 00 00 00 00 |t!..| 00013100 08 31 01 00 42 00 00 00 6c 64 63 6f 6e 66 69 67 |.1..B...ldconfig| 00013110 20 28 55 62 75 6e 74 75 20 47 4c 49 42 43 20 32 | (Ubuntu GLIBC 2| 00013120 2e 33 33 2d 30 75 62 75 6e 74 75 35 29 20 72 65 |.33-0ubuntu5) re| 00013130 6c 65 61 73 65 20 72 65 6c 65 61 73 65 20 76 65 |lease release ve| 00013140 72 73 69 6f 6e 20 32 2e 33 33|rsion 2.33| 0001314a {noformat} Trivial to fix, give me a minute... > [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails. > - > > Key: MESOS-10224 > URL: https://issues.apache.org/jira/browse/MESOS-10224 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.11.0 >Reporter: Saad Ur Rahman >Priority: Major > Attachments: ld.so.cache > > > *OS:* Ubuntu 21.04 > *Command:* > {code:java} > make -j 6 V=0 check{code} > Fails during the build and test suite run on two different machines with the > same OS. > {code:java} > 3: [ OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms) > 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest > (27011 ms total) > 3: > 3: [--] Global test environment tear-down > 3: [==] 575 tests from 178 test cases ran. (202572 ms total) > 3: [ PASSED ] 573 tests. > 3: [ FAILED ] 2 tests, listed below: > 3: [ FAILED ] LdcacheTest.Parse > 3: [ FAILED ] > CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where > GetParam() = "v0" > 3: > 3: 2 FAILED TESTS > 3: YOU HAVE 34 DISABLED TESTS > 3: > 3: > 3: > 3: [FAIL]: 4 shard(s) have failed tests > 3/3 Test #3: MesosTests ...***Failed 1173.43 sec > {code} > Are there any pre-requisites required to get the build/tests to pass? I am > trying to get all the tests to pass to make sure my build environment is > setup correctly for development. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
[ https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367579#comment-17367579 ] Charles Natali commented on MESOS-10224: No it should really work, it's a bit strange. Possible there's something special about your cache, but it looks valid since I can parse it using {{ldconfig -p}}. Shouldn't be too difficult to fix, hopefully. > [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails. > - > > Key: MESOS-10224 > URL: https://issues.apache.org/jira/browse/MESOS-10224 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.11.0 >Reporter: Saad Ur Rahman >Priority: Major > Attachments: ld.so.cache > > > *OS:* Ubuntu 21.04 > *Command:* > {code:java} > make -j 6 V=0 check{code} > Fails during the build and test suite run on two different machines with the > same OS. > {code:java} > 3: [ OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms) > 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest > (27011 ms total) > 3: > 3: [--] Global test environment tear-down > 3: [==] 575 tests from 178 test cases ran. (202572 ms total) > 3: [ PASSED ] 573 tests. > 3: [ FAILED ] 2 tests, listed below: > 3: [ FAILED ] LdcacheTest.Parse > 3: [ FAILED ] > CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where > GetParam() = "v0" > 3: > 3: 2 FAILED TESTS > 3: YOU HAVE 34 DISABLED TESTS > 3: > 3: > 3: > 3: [FAIL]: 4 shard(s) have failed tests > 3/3 Test #3: MesosTests ...***Failed 1173.43 sec > {code} > Are there any pre-requisites required to get the build/tests to pass? I am > trying to get all the tests to pass to make sure my build environment is > setup correctly for development. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
[ https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367572#comment-17367572 ] Charles Natali commented on MESOS-10224: Interesting, I can reproduce it - I'll have a look. > [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails. > - > > Key: MESOS-10224 > URL: https://issues.apache.org/jira/browse/MESOS-10224 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.11.0 >Reporter: Saad Ur Rahman >Priority: Major > Attachments: ld.so.cache > > > *OS:* Ubuntu 21.04 > *Command:* > {code:java} > make -j 6 V=0 check{code} > Fails during the build and test suite run on two different machines with the > same OS. > {code:java} > 3: [ OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms) > 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest > (27011 ms total) > 3: > 3: [--] Global test environment tear-down > 3: [==] 575 tests from 178 test cases ran. (202572 ms total) > 3: [ PASSED ] 573 tests. > 3: [ FAILED ] 2 tests, listed below: > 3: [ FAILED ] LdcacheTest.Parse > 3: [ FAILED ] > CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where > GetParam() = "v0" > 3: > 3: 2 FAILED TESTS > 3: YOU HAVE 34 DISABLED TESTS > 3: > 3: > 3: > 3: [FAIL]: 4 shard(s) have failed tests > 3/3 Test #3: MesosTests ...***Failed 1173.43 sec > {code} > Are there any pre-requisites required to get the build/tests to pass? I am > trying to get all the tests to pass to make sure my build environment is > setup correctly for development. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367559#comment-17367559 ] Charles Natali commented on MESOS-10223: By the way, the reason for running it as root is that many tests are only run as root (e.g. tests which need cgroups etc), so it'd be nice to make sure they pass. > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @ 0x91353dac abort > @ 0x91569848 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @ 0x915675b0 __cxa_rethrow > @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() >
[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367554#comment-17367554 ] Charles Natali commented on MESOS-10223: Hey [~mgrigorov] The attached patch should fix the issue - I ran all the test suite and it pretty much passed, however it would be great if you could run it as root with the attached patch, just to make sure [^0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch] There might be some unrelated/transient error though but I'm just interested to see that this problem is fixed. Thanks! > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, > mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @ 0x91353dac abort > @ 0x91569848 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @
[jira] [Comment Edited] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367092#comment-17367092 ] Charles Natali edited comment on MESOS-10223 at 6/22/21, 7:51 AM: -- {quote}I experienced the errors both on real ARM64 host and with Docker. The problem with strace is not related to QEMU. It is a Docker thingy. You need to add a capability for it: {{docker run --cap-add=SYS_PTRACE}} ... {quote} Hm, I don't think so, it's not a capability/seccomp issue: if you look at the error it's {{ENOSYS}} not {{EPERM}}. Just to show you: {noformat} root@thinkpad:/home/cf/mesos-on-arm64# docker run --cap-add=SYS_PTRACE --rm -it -v $PWD/mesos:/mesos bui ld-mesos-on-arm64 bash WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64) and no specific platform was requested root@4d45b9e91754:/mesos# apt install strace Reading package lists... Done Building dependency tree Reading state information... Done The following NEW packages will be installed: strace 0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded. Need to get 297 kB of archives. After this operation, 1336 kB of additional disk space will be used. Get:1 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 strace arm64 5.5-3ubuntu1 [297 kB] Fetched 297 kB in 1s (327 kB/s) Selecting previously unselected package strace. (Reading database ... 18530 files and directories currently installed.) Preparing to unpack .../strace_5.5-3ubuntu1_arm64.deb ... Unpacking strace (5.5-3ubuntu1) ... Setting up strace (5.5-3ubuntu1) ... root@4d45b9e91754:/mesos# strace ls /usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not implemented /usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented /usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented /usr/bin/strace: detach: waitpid(115): No child processes /usr/bin/strace: Process 115 detached {noformat} Did you test this docker+qemu image from a non ADM64 host? > I will send you privately credentials to my ARM64 VM where you can debug it > without Docker! Thanks, that'd be much easier was (Author: cf.natali): {quote}I experienced the errors both on real ARM64 host and with Docker. The problem with strace is not related to QEMU. It is a Docker thingy. You need to add a capability for it: {{docker run --cap-add=SYS_PTRACE}} ... {quote} Hm, I don't think so, it's not a capability/seccomp issue: if you look at the error it's {{ENOSYS}} not {{EPERM}}. Just to show you: {noformat} root@thinkpad:/home/cf/mesos-on-arm64# docker run --cap-add=SYS_PTRACE --rm -it -v $PWD/mesos:/mesos bui ld-mesos-on-arm64 bash WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64) and no specific platform was requested root@4d45b9e91754:/mesos# apt install strace Reading package lists... Done Building dependency tree Reading state information... Done The following NEW packages will be installed: strace 0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded. Need to get 297 kB of archives. After this operation, 1336 kB of additional disk space will be used. Get:1 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 strace arm64 5.5-3ubuntu1 [297 kB] Fetched 297 kB in 1s (327 kB/s) Selecting previously unselected package strace. (Reading database ... 18530 files and directories currently installed.) Preparing to unpack .../strace_5.5-3ubuntu1_arm64.deb ... Unpacking strace (5.5-3ubuntu1) ... Setting up strace (5.5-3ubuntu1) ... root@4d45b9e91754:/mesos# strace ls /usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not implemented /usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented /usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented /usr/bin/strace: detach: waitpid(115): No child processes /usr/bin/strace: Process 115 detached {noformat} Did you test this docker+qemu image from a non ADM64 host? > I will send you privately credentials to my ARM64 VM where you can debug it > without Docker! > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d
[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367092#comment-17367092 ] Charles Natali commented on MESOS-10223: {quote}I experienced the errors both on real ARM64 host and with Docker. The problem with strace is not related to QEMU. It is a Docker thingy. You need to add a capability for it: {{docker run --cap-add=SYS_PTRACE}} ... {quote} Hm, I don't think so, it's not a capability/seccomp issue: if you look at the error it's {{ENOSYS}} not {{EPERM}}. Just to show you: {noformat} root@thinkpad:/home/cf/mesos-on-arm64# docker run --cap-add=SYS_PTRACE --rm -it -v $PWD/mesos:/mesos bui ld-mesos-on-arm64 bash WARNING: The requested image's platform (linux/arm64) does not match the detected host platform (linux/amd64) and no specific platform was requested root@4d45b9e91754:/mesos# apt install strace Reading package lists... Done Building dependency tree Reading state information... Done The following NEW packages will be installed: strace 0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded. Need to get 297 kB of archives. After this operation, 1336 kB of additional disk space will be used. Get:1 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 strace arm64 5.5-3ubuntu1 [297 kB] Fetched 297 kB in 1s (327 kB/s) Selecting previously unselected package strace. (Reading database ... 18530 files and directories currently installed.) Preparing to unpack .../strace_5.5-3ubuntu1_arm64.deb ... Unpacking strace (5.5-3ubuntu1) ... Setting up strace (5.5-3ubuntu1) ... root@4d45b9e91754:/mesos# strace ls /usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not implemented /usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented /usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented /usr/bin/strace: detach: waitpid(115): No child processes /usr/bin/strace: Process 115 detached {noformat} Did you test this docker+qemu image from a non ADM64 host? > I will send you privately credentials to my ARM64 VM where you can debug it > without Docker! > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory >
[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366756#comment-17366756 ] Charles Natali commented on MESOS-10223: [~mgrigorov] Finally had time to have a look. Using your docker image I don't reproduce it: {noformat} [ RUN ] JsonTest.Find [ OK ] JsonTest.Find (21 ms) {noformat} I am however seeing other failures like: {noformat} [ RUN ] ProcessTest.Processes ../../../3rdparty/stout/tests/os/process_tests.cpp:139: Failure Expected: getppid() Which is: 1 To be equal to: process.parent Which is: 0 ../../../3rdparty/stout/tests/os/process_tests.cpp:144: Failure Expected: getsid(0) Which is: 1 To be equal to: process.session.get() Which is: 0 ../../../3rdparty/stout/tests/os/process_tests.cpp:148: Failure Expected: (process.rss.get()) > (0), actual: 0B vs 0 [ FAILED ] ProcessTest.Processes (9 ms) {noformat} However they can be do the fact that it's running inside docker/Qemu. Qemu has to do syscall ABI translation, and for example it doesn't support ptrace: {noformat} root@4d45b9e91754:/mesos# strace /bin/true /usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not implemented /usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented /usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented /usr/bin/strace: detach: waitpid(124): No child processes /usr/bin/strace: Process 124 detached root@4d45b9e91754:/mesos# {noformat} So it can very well cause the other failures above (which I can't debug without strace or gdb...). Regarding your original problem, is it on a real ARM64 host or within docker/qemu etc? > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory >
[jira] [Commented] (MESOS-10159) Running unit test command hangs
[ https://issues.apache.org/jira/browse/MESOS-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365141#comment-17365141 ] Charles Natali commented on MESOS-10159: [~jineshpatel] I know it's been a while but if you're still experiencing the issue please let us know, otherwise I'll close this ticket. Cheers, > Running unit test command hangs > --- > > Key: MESOS-10159 > URL: https://issues.apache.org/jira/browse/MESOS-10159 > Project: Mesos > Issue Type: Bug > Components: test > Environment: OS: Ubuntu 20.04 > Arch: Intel >Reporter: Jinesh Patel >Priority: Minor > Labels: test > > Running the `make check` command to execute mesos test cases hangs after > printing failed test results. The process doesn't hang if all test cases pass. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9713) Support specifying output file name for URI fetcher
[ https://issues.apache.org/jira/browse/MESOS-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365096#comment-17365096 ] Charles Natali commented on MESOS-9713: --- Great! Can't find any way to send a DM on here so here's my email: cf.natali _at_ gmail.com Shoot me an email and we can chat a bit about what you could do! > Support specifying output file name for URI fetcher > --- > > Key: MESOS-9713 > URL: https://issues.apache.org/jira/browse/MESOS-9713 > Project: Mesos > Issue Type: Improvement > Components: fetcher >Reporter: Qian Zhang >Priority: Major > Labels: newbie > > Currently URI fetcher's `fetch` method is defined like below: > {code:java} > process::Future fetch( > const URI& uri, > const std::string& directory, > const Option& data = None()) const; > {code} > So caller can only specify the directory that the URI will be downloaded to > but not the name of the output file which has to be same with base name of > the URI path. We'd better to introduce an output file name parameter so that > caller can customize the output file name. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10211) mesos agent crashes every time when launched tensorboard in a horovod image with mesos container
[ https://issues.apache.org/jira/browse/MESOS-10211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365087#comment-17365087 ] Charles Natali commented on MESOS-10211: [~ggmmggmm2] so could you give more details? > mesos agent crashes every time when launched tensorboard in a horovod image > with mesos container > > > Key: MESOS-10211 > URL: https://issues.apache.org/jira/browse/MESOS-10211 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.11.0 > Environment: agent:ubuntu18.04 >Reporter: YZ sun >Priority: Critical > > When launch a task using image > "horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1", > if tensorboard in this image is started, > the agent node will immediately crash every time. > if tensorboard is not started by command, mesos will just work as expected. > agent log looks like below: > {code:java} > //agent crash > I0127 16:07:21.860065 30960 slave.cpp:3181] Launching task > 'baseEnvSingle_gpunode1' for framework baseDevEnv_root_1611734806 > F0127 16:07:21.860143 30960 slave.cpp:3194] Check failed: executor == nullptr > *** Check failure stack trace: *** > @ 0x7f2bcc4221fc google::LogMessage::Fail() > @ 0x7f2bcc422145 google::LogMessage::SendToLog() > @ 0x7f2bcc421ad1 google::LogMessage::Flush() > @ 0x7f2bcc4251e8 google::LogMessageFatal::~LogMessageFatal() > @ 0x7f2bca4cb10b mesos::internal::slave::Slave::__run() > @ 0x7f2bca570ac6 > _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSB_INS1_13TaskGroupInfoEERKSt6vectorINS2_19ResourceVersionUUIDESaISL_EERKSB_IbEbS7_SA_SF_SJ_SP_SS_bEEvRKNS_3PIDIT_EEMSU_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_ENKUlOS5_OS8_OSD_OSH_OSN_OSQ_ObPNS_11ProcessBaseEE_clES1L_S1M_S1N_S1O_S1P_S1Q_S1R_S1T_ > @ 0x7f2bca663b01 > _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS3_13FrameworkInfoERKNS3_12ExecutorInfoERK6OptionINS3_8TaskInfoEERKSD_INS3_13TaskGroupInfoEERKSt6vectorINS4_19ResourceVersionUUIDESaISN_EERKSD_IbEbS9_SC_SH_SL_SR_SU_bEEvRKNS1_3PIDIT_EEMSW_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS7_OSA_OSF_OSJ_OSP_OSS_ObPNS1_11ProcessBaseEE_JS7_SA_SF_SJ_SP_SS_bS1V_EEEDTclcl7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1X_ > @ 0x7f2bca6555dc > _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi113invoke_expandIS1X_St5tupleIJS8_SB_SG_SK_SQ_ST_bS1Z_EES22_IJOS1W_EEJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7DTcl6invokecl7forwardISX_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS12_Efp2_OSX_OS11_N5cpp1416integer_sequenceImJXspT2_OS12_ > @ 0x7f2bca64da94 > _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1clIJS1W_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS25_ > @ 0x7f2bca647e56 > _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS6_13FrameworkInfoERKNS6_12ExecutorInfoERK6OptionINS6_8TaskInfoEERKSG_INS6_13TaskGroupInfoEERKSt6vectorINS7_19ResourceVersionUUIDESaISQ_EERKSG_IbEbSC_SF_SK_SO_SU_SX_bEEvRKNS4_3PIDIT_EEMSZ_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSA_OSD_OSI_OSM_OSS_OSV_ObPNS4_11ProcessBaseEE_JSA_SD_SI_SM_SS_SV_bSt12_PlaceholderILi1EJS1Y_EEEDTclcl7forwardISZ_Efp_Espcl7forwardIT0_Efp0_EEEOSZ_DpOS23_ > @ 0x7f2bca645145 > _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS7_13FrameworkInfoERKNS7_12ExecutorInfoERK6OptionINS7_8TaskInfoEERKSH_INS7_13TaskGroupInfoEERKSt6vectorINS8_19ResourceVersionUUIDESaISR_EERKSH_IbEbSD_SG_SL_SP_SV_SY_bEEvRKNS5_3PIDIT_EEMS10_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSB_OSE_OSJ_OSN_OST_OSW_ObPNS5_11ProcessBaseEE_JSB_SE_SJ_SN_ST_SW_bSt12_PlaceholderILi1EJS1Z_EEEvOS10_DpOT0_ > @
[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values
[ https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365085#comment-17365085 ] Charles Natali commented on MESOS-10216: [~asekretenko] so what do you think, can we close? > Replicated log key encoding overflows into negative values > -- > > Key: MESOS-10216 > URL: https://issues.apache.org/jira/browse/MESOS-10216 > Project: Mesos > Issue Type: Bug > Components: replicated log >Affects Versions: 1.7.3, 1.8.1, 1.9.1, 1.11.0, 1.10.1, 1.12.0 >Reporter: Ilya >Assignee: Charles Natali >Priority: Major > Fix For: 1.12.0 > > > LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions > encoded as strings and padded with zeroes up to a certain fixed size. The > {{encode()}} function is incorrect because it uses the {{%d}} formatter that > expects an {{int}}. It also limits the key size to 10 digits which is OK for > {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}. > Because of this the available key range is reduced, and key overflow can > result in replica's {{METADATA}} record (position 0) being overwritten, which > in turn may cause data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10219) 1.11.0 does not build on Windows
[ https://issues.apache.org/jira/browse/MESOS-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365076#comment-17365076 ] Charles Natali commented on MESOS-10219: [~acecile555] As [~apeters] mentioned, the project is currently very short-staffed. It was actually quite close to shutting down just a few weeks ago, see [https://lists.apache.org/thread.html/rab2a820507f7c846e54a847398ab20f47698ec5bce0c8e182bfe51ba%40%3Cdev.mesos.apache.org%3E] Right now it's just Andreas and me actively contributing, and a couple other committers who do reviews. And unfortunately, I don't have any experience with Windows. And to be honest, I managed to avoid working with Windows in my 15 years experience, so I'm not really motivated to spend hours to debug some Windows-specific issues, sorry. So I think your options are: # Continue the work you've been doing. You've been making progress, hopefully you'll get there eventually, and learn in the process. As a bonus, it lets you familiarize yourself with the code base, in case you'd be interested to contribute to the project on an ongoing basis. # Ask for help on the users and developers mailing lists ([http://mesos.apache.org/community/#mailing-lists)] - maybe someone who knows Windows will be willing to help. # Give up. >From my point of view while I'm not willing to spend days learning about >Windows and its various cryptic APIs, I'll be willing to review your changes >and help get them merge. Good luck! > 1.11.0 does not build on Windows > > > Key: MESOS-10219 > URL: https://issues.apache.org/jira/browse/MESOS-10219 > Project: Mesos > Issue Type: Bug > Components: agent, build, cmake >Affects Versions: 1.11.0 >Reporter: acecile555 >Priority: Major > Attachments: mesos_slave_windows_longpath.png, > patch_1.10.0_windows_build.diff > > > Hello, > > I just tried building Mesos 1.11.0 on Windows and this is not working. > > The first issue is libarchive compilation that can be easily workarounded by > adding the following hunk to 3rdparty/libarchive-3.3.2.patch: > {noformat} > --- a/CMakeLists.txt > +++ b/CMakeLists.txt > @@ -137,7 +137,7 @@ ># This is added into CMAKE_C_FLAGS when CMAKE_BUILD_TYPE is "Debug" ># Enable level 4 C4061: The enumerate has no associated handler in a switch ># statement. > - SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4061") > + #SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4061") ># Enable level 4 C4254: A larger bit field was assigned to a smaller bit ># field. >SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4254") > {noformat} > Sadly it is failing later with issue I cannot solve myself: > {noformat} > C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot > open include file: 'csi/state.pb.h': No such file or directory (compiling > source file C:\Users\earthlab\mesos\src\slave\csi_server.cpp) > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > qos_controller.cpp > resource_estimator.cpp > slave.cpp > state.cpp > task_status_update_manager.cpp > sandbox.cpp > C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot > open include file: 'csi/state.pb.h': No such file or directory (compiling > source file C:\Users\earthlab\mesos\src\slave\slave.cpp) > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > composing.cpp > isolator.cpp > C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot > open include file: 'csi/state.pb.h': No such file or directory (compiling > source file C:\Users\earthlab\mesos\src\slave\task_status_update_manager.cpp) > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > isolator_tracker.cpp > launch.cpp > C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot > open include file: 'csi/state.pb.h': No such file or directory (compiling > source file C:\Users\earthlab\mesos\src\slave\containerizer\composing.cpp) > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > launcher.cpp > C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp(524,34): > error C2668: 'os::spawn': ambiguous call to overloaded function > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > C:\Users\earthlab\mesos\3rdparty\stout\include\stout/os/exec.hpp(52,20): > message : could be 'Option os::spawn(const std::string &,const > std::vector> &)' > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > with > [ > T=int > ] (compiling source file > C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp) > C:\Users\earthlab\mesos\3rdparty\stout\include\stout/os/windows/exec.hpp(412,20): > message : or 'Option os::spawn(const
[jira] [Commented] (MESOS-10219) 1.11.0 does not build on Windows
[ https://issues.apache.org/jira/browse/MESOS-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364769#comment-17364769 ] Charles Natali commented on MESOS-10219: Hey, I'm fairly proficient with POSIX and C++, however I don't know anything about windows. I can try to have a look, however it's not clear to me whether this builds or not on windows: when you say {quote}From src/slave/containerizer/mesos/isolators/filesystem/posix.cpp It's raising the exception saying the file does not exist. {quote} I assume you mean that the agent logs an error at runtime? If yes then I don't think it's supposed to work at all on Windows, I don't think you should be using the posix isolator on Windows - the doc mentions ([https://mesos.apache.org/documentation/latest/configuration/agent/):] {noformat} default: windows/cpu,windows/mem on Windows; posix/cpu,posix/mem on other platforms) {noformat} > 1.11.0 does not build on Windows > > > Key: MESOS-10219 > URL: https://issues.apache.org/jira/browse/MESOS-10219 > Project: Mesos > Issue Type: Bug > Components: agent, build, cmake >Affects Versions: 1.11.0 >Reporter: acecile555 >Priority: Major > Attachments: patch_1.10.0_windows_build.diff > > > Hello, > > I just tried building Mesos 1.11.0 on Windows and this is not working. > > The first issue is libarchive compilation that can be easily workarounded by > adding the following hunk to 3rdparty/libarchive-3.3.2.patch: > {noformat} > --- a/CMakeLists.txt > +++ b/CMakeLists.txt > @@ -137,7 +137,7 @@ ># This is added into CMAKE_C_FLAGS when CMAKE_BUILD_TYPE is "Debug" ># Enable level 4 C4061: The enumerate has no associated handler in a switch ># statement. > - SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4061") > + #SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4061") ># Enable level 4 C4254: A larger bit field was assigned to a smaller bit ># field. >SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4254") > {noformat} > Sadly it is failing later with issue I cannot solve myself: > {noformat} > C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot > open include file: 'csi/state.pb.h': No such file or directory (compiling > source file C:\Users\earthlab\mesos\src\slave\csi_server.cpp) > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > qos_controller.cpp > resource_estimator.cpp > slave.cpp > state.cpp > task_status_update_manager.cpp > sandbox.cpp > C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot > open include file: 'csi/state.pb.h': No such file or directory (compiling > source file C:\Users\earthlab\mesos\src\slave\slave.cpp) > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > composing.cpp > isolator.cpp > C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot > open include file: 'csi/state.pb.h': No such file or directory (compiling > source file C:\Users\earthlab\mesos\src\slave\task_status_update_manager.cpp) > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > isolator_tracker.cpp > launch.cpp > C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot > open include file: 'csi/state.pb.h': No such file or directory (compiling > source file C:\Users\earthlab\mesos\src\slave\containerizer\composing.cpp) > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > launcher.cpp > C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp(524,34): > error C2668: 'os::spawn': ambiguous call to overloaded function > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > C:\Users\earthlab\mesos\3rdparty\stout\include\stout/os/exec.hpp(52,20): > message : could be 'Option os::spawn(const std::string &,const > std::vector> &)' > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > with > [ > T=int > ] (compiling source file > C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp) > C:\Users\earthlab\mesos\3rdparty\stout\include\stout/os/windows/exec.hpp(412,20): > message : or 'Option os::spawn(const std::string &,const > std::vector> &,const > Option,std::allocator std::string,std::string &)' > [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > with > [ > T=int > ] (compiling source file > C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp) > C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp(525,75): > message : while trying to match the argument list '(const char [3], > initializer list)' [C:\Users\earthlab\mesos\build\src\mesos.vcxproj] > C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp(893,47): > error C2668:
[jira] [Commented] (MESOS-9950) memory cgroup gone before isolator cleaning up
[ https://issues.apache.org/jira/browse/MESOS-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363126#comment-17363126 ] Charles Natali commented on MESOS-9950: --- [~subhajitpalit] Is the agent started via systemd? If yes, could you post the output of: {noformat} # systemctl show | grep Delegate {noformat} > memory cgroup gone before isolator cleaning up > -- > > Key: MESOS-9950 > URL: https://issues.apache.org/jira/browse/MESOS-9950 > Project: Mesos > Issue Type: Bug > Components: containerization >Reporter: longfei >Priority: Major > > The memcg created by mesos may have been deleted before cgroup/memory > isolator cleaning up. > This would let the termination fail and lose information in the old > termination(before fail). > {code:java} > I0821 15:16:03.025796 3354800 paths.cpp:745] Creating sandbox > '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89' > for user 'tiger' > I0821 15:16:03.026199 3354800 paths.cpp:748] Creating sandbox > '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/meta/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89' > I0821 15:16:03.026304 3354800 slave.cpp:9064] Launching executor > 'mt:z03584687:1' of framework > 8e4967e5-736e-4a22-90c3-7b32d526914d- with resources > [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] > in work directory > '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89' > I0821 15:16:03.051795 3354800 slave.cpp:3520] Launching container > a0706ca0-fe2c-4477-8161-329b26ea5d89 for executor > 'mt:z03584687:1' of framework > 8e4967e5-736e-4a22-90c3-7b32d526914d- > I0821 15:16:03.076608 3354807 containerizer.cpp:1325] Starting container > a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.076911 3354807 containerizer.cpp:3185] Transitioning the state > of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from PROVISIONING to > PREPARING > I0821 15:16:03.077906 3354802 memory.cpp:478] Started listening for OOM > events for container a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.079540 3354804 memory.cpp:198] Updated > 'memory.soft_limit_in_bytes' to 4032MB for container > a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.079587 3354820 cpu.cpp:92] Updated 'cpu.shares' to 1126 (cpus > 1.1) for container a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.079589 3354804 memory.cpp:227] Updated 'memory.limit_in_bytes' > to 4032MB for container a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.080901 3354802 switchboard.cpp:316] Container logger module > finished preparing container a0706ca0-fe2c-4477-8161-329b26ea5d89; > IOSwitchboard server is not required > I0821 15:16:03.081593 3354801 linux_launcher.cpp:492] Launching container > a0706ca0-fe2c-4477-8161-329b26ea5d89 and cloning with namespaces > I0821 15:16:03.083823 3354808 containerizer.cpp:2107] Checkpointing > container's forked pid 1857418 to > '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/meta/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89/pids/forked.pid' > I0821 15:16:03.084156 3354808 containerizer.cpp:3185] Transitioning the state > of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from PREPARING to ISOLATING > I0821 15:16:03.091468 3354808 containerizer.cpp:3185] Transitioning the state > of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from ISOLATING to FETCHING > I0821 15:16:03.094933 3354808 containerizer.cpp:3185] Transitioning the state > of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from FETCHING to RUNNING > I0821 15:16:03.197753 3354808 memory.cpp:198] Updated > 'memory.soft_limit_in_bytes' to 4032MB for container > a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:16:03.197757 3354801 cpu.cpp:92] Updated 'cpu.shares' to 1126 (cpus > 1.1) for container a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:21:39.692978 3354814 memory.cpp:515] OOM detected for container > a0706ca0-fe2c-4477-8161-329b26ea5d89 > I0821 15:21:39.693182 3354805 containerizer.cpp:3044] Container > a0706ca0-fe2c-4477-8161-329b26ea5d89 has reached its limit for resource [] >
[jira] [Comment Edited] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358829#comment-17358829 ] Charles Natali edited comment on MESOS-10223 at 6/7/21, 7:47 PM: - Ah, I mis-interpreted {quote}If you don't have access to Linux ARM64 to reproduce it and to test potential fixes I've attached [^mesos-on-arm64.tgz] {quote} OK, I'll try to have a look at it sometime this week, but. >From a quick look this part of the traceback for one of the failing tests is >interesting: {noformat} @ 0xb0040544 __cxa_throw @ 0xaddee114 boost::throw_exception<>() @ 0xadec512c boost::conversion::detail::throw_bad_cast<>() @ 0xadec2228 boost::lexical_cast<>() @ 0xadebf89c numify<>() @ 0xadf71e3c proc::pids() @ 0xadf73594 os::pids() @ 0xadf73fb4 os::processes() {noformat} Looks like it's failing to parse PIDs under {{/proc}}. However looking at the code - https://github.com/apache/mesos/blob/7841fcc848ebaac5de43cd4cccf1c243a3cdff56/3rdparty/stout/include/stout/numify.hpp#L49 - {{boost::bad_lexical_cast}} should be handed, so it's a bit strange. Looks like a problem during building/linking, might be missing some unwinder symbols. Did you cross-compile this? was (Author: cf.natali): Ah, I mis-interpreted {quote}If you don't have access to Linux ARM64 to reproduce it and to test potential fixes I've attached [^mesos-on-arm64.tgz] {quote} OK, I'll try to have a look at it sometime this week, but. >From a quick look this part of the traceback for one of the failing tests is >interesting: {noformat} @ 0xb0040544 __cxa_throw @ 0xaddee114 boost::throw_exception<>() @ 0xadec512c boost::conversion::detail::throw_bad_cast<>() @ 0xadec2228 boost::lexical_cast<>() @ 0xadebf89c numify<>() @ 0xadf71e3c proc::pids() @ 0xadf73594 os::pids() @ 0xadf73fb4 os::processes() {noformat} Looks like it's failing to parse PIDs under {{/proc}}. However looking at the code [https://github.com/apache/mesos/blob/7841fcc848ebaac5de43cd4cccf1c243a3cdff56/3rdparty/stout/include/stout/numify.hpp#L49 {{boost::bad_lexical_cast}} should be handed, so it's a bit strange. Looks like a problem during building/linking, might be missing some unwinder symbols. Did you cross-compile this? > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418
[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358829#comment-17358829 ] Charles Natali commented on MESOS-10223: Ah, I mis-interpreted {quote}If you don't have access to Linux ARM64 to reproduce it and to test potential fixes I've attached [^mesos-on-arm64.tgz] {quote} OK, I'll try to have a look at it sometime this week, but. >From a quick look this part of the traceback for one of the failing tests is >interesting: {noformat} @ 0xb0040544 __cxa_throw @ 0xaddee114 boost::throw_exception<>() @ 0xadec512c boost::conversion::detail::throw_bad_cast<>() @ 0xadec2228 boost::lexical_cast<>() @ 0xadebf89c numify<>() @ 0xadf71e3c proc::pids() @ 0xadf73594 os::pids() @ 0xadf73fb4 os::processes() {noformat} Looks like it's failing to parse PIDs under {{/proc}}. However looking at the code [https://github.com/apache/mesos/blob/7841fcc848ebaac5de43cd4cccf1c243a3cdff56/3rdparty/stout/include/stout/numify.hpp#L49 {{boost::bad_lexical_cast}} should be handed, so it's a bit strange. Looks like a problem during building/linking, might be missing some unwinder symbols. Did you cross-compile this? > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [
[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64
[ https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358806#comment-17358806 ] Charles Natali commented on MESOS-10223: Hi [~mgrigorov] Thanks! Would it be possible for you to open a PR (https://github.com/apache/mesos/pulls) for the changes, would be easier to review than a {{tar}}. Cheers, > Test failures on Linux ARM64 > > > Key: MESOS-10223 > URL: https://issues.apache.org/jira/browse/MESOS-10223 > Project: Mesos > Issue Type: Bug >Reporter: Martin Tzvetanov Grigorov >Priority: Major > Attachments: mesos-on-arm64.tgz > > > Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors: > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.NumberFormat > [ OK ] JsonTest.NumberFormat (0 ms) > [ RUN ] JsonTest.Find > terminate called after throwing an instance of > 'boost::exception_detail::clone_impl > >' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from > PID 2317; stack trace: *** > @ 0xa80e77fc ([vdso]+0x7fb) > @ 0xa7b71188 gsignal > @ 0xa7b5ddac abort > @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d715b0 __cxa_rethrow > @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler() > @ 0xa7d711ec (unknown) > @ 0xa7d71250 std::terminate() > @ 0xa7d71544 __cxa_throw > @ 0xab4ee114 boost::throw_exception<>() > @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>() > @ 0xab5c2228 boost::lexical_cast<>() > @ 0xab5bf89c numify<>() > @ 0xab5e00e8 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5e0584 JSON::Object::find<>() > @ 0xab5cdd2c JsonTest_Find_Test::TestBody() > @ 0xab886fec > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87f1d4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab85a9d0 testing::Test::Run() > @ 0xab85b258 testing::TestInfo::Run() > @ 0xab85b8d0 testing::TestCase::Run() > @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests() > @ 0xab888440 > testing::internal::HandleSehExceptionsInMethodIfSupported<>() > @ 0xab87ffd4 > testing::internal::HandleExceptionsInMethodIfSupported<>() > @ 0xab86100c testing::UnitTest::Run() > @ 0xab630950 RUN_ALL_TESTS() > @ 0xab630418 main > @ 0xa7b5e110 __libc_start_main > @ 0xab4b41d4 (unknown) > [FAIL]: 8 shard(s) have failed tests > make[6]: *** [Makefile:2092: check-local] Error 8 > make[6]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[5]: *** [Makefile:1840: check-am] Error 2 > make[5]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[4]: *** [Makefile:1685: check-recursive] Error 1 > make[4]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[3]: *** [Makefile:1842: check] Error 2 > make[3]: Leaving directory > '/home/ubuntu/git/apache/mesos/build/3rdparty/stout' > make[2]: *** [Makefile:1153: check-recursive] Error 1 > make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make[1]: *** [Makefile:1306: check] Error 2 > make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty' > make: *** [Makefile:785: check-recursive] Error 1 > {code} > > {code:java} > [--] 3 tests from JsonTest > [ RUN ] JsonTest.InvalidUTF8 > [ OK ] JsonTest.InvalidUTF8 (0 ms) > [ RUN ] JsonTest.ParseError > terminate called after throwing an instance of 'std::overflow_error' > terminate called recursively > *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are > using GNU date *** > PC: @0x0 (unknown) > *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from > PID 2316; stack trace: *** > @ 0x918dd7fc ([vdso]+0x7fb) > @ 0x91367188 gsignal > @ 0x91353dac abort > @ 0x91569848 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @ 0x915675b0 __cxa_rethrow > @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler() > @ 0x915671ec (unknown) > @ 0x91567250 std::terminate() > @ 0x91567544 __cxa_throw >
[jira] [Commented] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
[ https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358052#comment-17358052 ] Charles Natali commented on MESOS-10222: [https://github.com/apache/mesos/pull/393] This with the previous PR allows to build with {{-Werror}} using: {noformat} gcc (Debian 10.2.1-6) 10.2.1 20210110 {noformat} > Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses > --- > > Key: MESOS-10222 > URL: https://issues.apache.org/jira/browse/MESOS-10222 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Martin Tzvetanov Grigorov >Priority: Minor > Attachments: config.log > > > I am trying to build Mesos master but it fails with: > > {code:java} > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38, > from > ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12, > from ../../3rdparty/stout/include/stout/uuid.hpp:21, > from ../../include/mesos/type_utils.hpp:36, > from ../../src/master/flags.cpp:18: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27, > from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30, > from ../../3rdparty/stout/include/stout/numify.hpp:19, > from ../../3rdparty/stout/include/stout/duration.hpp:29, > from ../../3rdparty/libprocess/include/process/time.hpp:18, > from ../../3rdparty/libprocess/include/process/clock.hpp:18, > from ../../3rdparty/libprocess/include/process/future.hpp:29, > from > ../../include/mesos/authentication/secret_generator.hpp:22, > from ../../src/local/local.cpp:24: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11, > from ../../include/mesos/resources.hpp:27, > from ../../src/master/master.hpp:31, > from ../../src/master/framework.cpp:17: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from >
[jira] [Commented] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
[ https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358045#comment-17358045 ] Charles Natali commented on MESOS-10222: Thanks. I create a PR to fix some compilation warnings in picojson: [https://github.com/apache/mesos/pull/392] I'll have a look at boost next. > Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses > --- > > Key: MESOS-10222 > URL: https://issues.apache.org/jira/browse/MESOS-10222 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Martin Tzvetanov Grigorov >Priority: Minor > Attachments: config.log > > > I am trying to build Mesos master but it fails with: > > {code:java} > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38, > from > ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12, > from ../../3rdparty/stout/include/stout/uuid.hpp:21, > from ../../include/mesos/type_utils.hpp:36, > from ../../src/master/flags.cpp:18: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27, > from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30, > from ../../3rdparty/stout/include/stout/numify.hpp:19, > from ../../3rdparty/stout/include/stout/duration.hpp:29, > from ../../3rdparty/libprocess/include/process/time.hpp:18, > from ../../3rdparty/libprocess/include/process/clock.hpp:18, > from ../../3rdparty/libprocess/include/process/future.hpp:29, > from > ../../include/mesos/authentication/secret_generator.hpp:22, > from ../../src/local/local.cpp:24: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11, > from ../../include/mesos/resources.hpp:27, > from ../../src/master/master.hpp:31, > from ../../src/master/framework.cpp:17: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, >
[jira] [Commented] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
[ https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356602#comment-17356602 ] Charles Natali commented on MESOS-10222: Yes I'm seeing a similar error on my Debian bullseye: {noformat} cf@thinkpad:~$ gcc --version gcc (Debian 10.2.1-6) 10.2.1 20210110 {noformat} I'm also seeing warnings in one of our JSON libraries. [~asekretenko] It's next on my list to look at this, however assuming that the warnings have been fixed by upstream, will we want to: * update to the version fixing them * or cherry-pick individual fixes More generally what's the policy for updating third-party dependencies? > Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses > --- > > Key: MESOS-10222 > URL: https://issues.apache.org/jira/browse/MESOS-10222 > Project: Mesos > Issue Type: Bug > Components: build >Reporter: Martin Tzvetanov Grigorov >Priority: Minor > Attachments: config.log > > > I am trying to build Mesos master but it fails with: > > {code:java} > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38, > from > ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12, > from ../../3rdparty/stout/include/stout/uuid.hpp:21, > from ../../include/mesos/type_utils.hpp:36, > from ../../src/master/flags.cpp:18: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27, > from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30, > from ../../3rdparty/stout/include/stout/numify.hpp:19, > from ../../3rdparty/stout/include/stout/duration.hpp:29, > from ../../3rdparty/libprocess/include/process/time.hpp:18, > from ../../3rdparty/libprocess/include/process/clock.hpp:18, > from ../../3rdparty/libprocess/include/process/future.hpp:29, > from > ../../include/mesos/authentication/secret_generator.hpp:22, > from ../../src/local/local.cpp:24: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > | ^ > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary > parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses] > 193 | failed (boost::mpl::not_:: > | ^ > In file included from > ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23, > from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25, > from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17, > from > ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14, > from > ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11, > from ../../include/mesos/resources.hpp:27, > from ../../src/master/master.hpp:31, > from ../../src/master/framework.cpp:17: > ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary > parentheses in declaration of ‘assert_arg’ [-Werror=parentheses] > 188 | failed (Pred:: > |
[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values
[ https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355896#comment-17355896 ] Charles Natali commented on MESOS-10216: Tough one, although I would err on the side of safety and not backport, since it's been present since basically forever and realistically only affects few users. > Replicated log key encoding overflows into negative values > -- > > Key: MESOS-10216 > URL: https://issues.apache.org/jira/browse/MESOS-10216 > Project: Mesos > Issue Type: Bug > Components: replicated log >Affects Versions: 1.7.3, 1.8.1, 1.9.1, 1.11.0, 1.10.1, 1.12.0 >Reporter: Ilya >Assignee: Charles Natali >Priority: Major > Fix For: 1.12.0 > > > LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions > encoded as strings and padded with zeroes up to a certain fixed size. The > {{encode()}} function is incorrect because it uses the {{%d}} formatter that > expects an {{int}}. It also limits the key size to 10 digits which is OK for > {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}. > Because of this the available key range is reduced, and key overflow can > result in replica's {{METADATA}} record (position 0) being overwritten, which > in turn may cause data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10221) A large number of TASK_LOST causes the task to be unable to run
[ https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354096#comment-17354096 ] Charles Natali commented on MESOS-10221: > In addition, according to the framework running log, the accept information >is sent immediately after the offer is received, but the accept information in >the master log is far behind the send offer, so is it that the accept has not >been processed immediately, or is it that I have a wrong understanding of the >time of the send offer. Yeah that looks suspicious, it'd be good to have the full logs of the master and framework so we can compare the timestamps of: * the offer being sent by the master * the offer being received by the framework * the accept being sent by the framework * the accept being received by the master > A large number of TASK_LOST causes the task to be unable to run > --- > > Key: MESOS-10221 > URL: https://issues.apache.org/jira/browse/MESOS-10221 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.9.0, 1.11.0 > Environment: Ubuntu 16.04 >Reporter: clancyhuang >Priority: Major > > Recently, we found that the mesos master frequently generates Task lost > exceptions after task submission, and retrying in a short period of time is > not feasible, and it is becoming more and more frequent. > We selected two abnormal logs > {code:java} > I0528 15:09:55.367336 964 master.cpp:9579] Sending offers [ > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:10:25.369561 969 master.cpp:11878] Removing offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 > I0528 15:10:43.383028 959 http.cpp:1436] HTTP POST for > /master/api/v1/scheduler from 10.118.28.66:50484 with > User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' > I0528 15:10:43.383656 959 master.cpp:5434] Processing DECLINE call for > offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 > seconds filter > I0528 15:10:03.385080 971 master.cpp:9579] Sending offers [ > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:10:33.386322 972 master.cpp:11878] Removing offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 > I0528 15:10:57.181581 967 http.cpp:1436] HTTP POST for > /master/api/v1/scheduler from 10.118.28.66:50484 with > User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' > W0528 15:10:57.183194 967 master.cpp:3959] Ignoring accept of offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid > W0528 15:10:57.183265 967 master.cpp:3964] ACCEPT call used invalid offers > '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid > I0528 15:10:57.184392 967 master.cpp:8212] Sending status update TASK_LOST > for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: > Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid' > {code} > The following is a log of normal execution > {code:java} > I0528 15:17:03.690855 959 master.cpp:9579] Sending offers [ > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529, > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13530 ] to framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:17:03.742848 970 http.cpp:1436] HTTP POST for > /master/api/v1/scheduler from 10.118.28.66:50484 with > User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' > I0528 15:17:03.745221 970 master.cpp:4356] Processing ACCEPT call for > offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 ] on agent > cbe540a8-c894-4655-a899-cec7463d00c9-S2 at slave(1)@ip:5053 (ip) for > framework 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:17:03.745889 970 master.cpp:11878] Removing offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 > {code} > We found that the offer was cancelled before accept when the exception > occurred,and the interval time is just the configured offer-timeout. Our > framework communicates with mesos based on http, I am sure that he sends the > accept message immediately after receiving the offer and the request is > successful. > The question is why sometimes the master processes the accept message after > the offer times out. In addition, we tried to increase the offer-timeout, but > the problem was not resolved -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10221) A large number of TASK_LOST causes the task to be unable to run
[ https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353855#comment-17353855 ] Charles Natali commented on MESOS-10221: Hey [~934341445], Receiving {{TASK_LOST}} upon a stale offer is perfectly fine and can occur in a normal and healthy cluster, and so should therefore be handled by the framework. Here's an annotated log: {noformat} # the master sends an offer to the framework I0528 15:10:03.385080 971 master.cpp:9579] Sending offers [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework 24b62b35-26d6-4a13-ba75- d84ce5fed64e-0005 (Test HTTP Framework) # the master removes the offer: from that point, it is not valid, and any task submitted against it will be rejected with TASK_LOST I0528 15:10:33.386322 972 master.cpp:11878] Removing offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 # here the master receives an ACCEPT from the framework using this offer, which isn't valid anymore I0528 15:10:57.181581 967 http.cpp:1436] HTTP POST for /master/api/v1/scheduler from 10.118.28.66:50484 with User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' # and therefore rejects it W0528 15:10:57.183265 967 master.cpp:3964] ACCEPT call used invalid offers '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid I0528 15:10:57.184392 967 master.cpp:8212] Sending status update TASK_LOST for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid' {noformat} However one thing which I notice from the above log is that there is a 24s gap between the master removing the offer (at 15:10:33.386322) and the framework trying to accept it (at 15:10:57.181581): normally, the master should have sent a {{RESCIND}} to the framework when the offer was removed (see [http://mesos.apache.org/documentation/latest/scheduler-http-api/#rescind]). Does your framework handle RESCIND? If not, this would make such rejections with {{TASK_LOST}} much more frequent than if it did. Also, do you know what triggered the offer to be removed? One common cause is if an agent is disconnected for example, does that happen a lot in your cluster? What happened in this specific example, I'm surprised to not see more context in the log, did you filter out some lines? > A large number of TASK_LOST causes the task to be unable to run > --- > > Key: MESOS-10221 > URL: https://issues.apache.org/jira/browse/MESOS-10221 > Project: Mesos > Issue Type: Bug > Components: master >Affects Versions: 1.9.0, 1.11.0 > Environment: Ubuntu 16.04 >Reporter: clancyhuang >Priority: Major > > Recently, we found that the mesos master frequently generates Task lost > exceptions after task submission, and retrying in a short period of time is > not feasible, and it is becoming more and more frequent. > We selected two abnormal logs > {code:java} > I0528 15:09:55.367336 964 master.cpp:9579] Sending offers [ > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:10:25.369561 969 master.cpp:11878] Removing offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 > I0528 15:10:43.383028 959 http.cpp:1436] HTTP POST for > /master/api/v1/scheduler from 10.118.28.66:50484 with > User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' > I0528 15:10:43.383656 959 master.cpp:5434] Processing DECLINE call for > offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 > seconds filter > I0528 15:10:03.385080 971 master.cpp:9579] Sending offers [ > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework > 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) > I0528 15:10:33.386322 972 master.cpp:11878] Removing offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 > I0528 15:10:57.181581 967 http.cpp:1436] HTTP POST for > /master/api/v1/scheduler from 10.118.28.66:50484 with > User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)' > W0528 15:10:57.183194 967 master.cpp:3959] Ignoring accept of offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid > W0528 15:10:57.183265 967 master.cpp:3964] ACCEPT call used invalid offers > '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer > 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid > I0528 15:10:57.184392 967 master.cpp:8212] Sending status update TASK_LOST > for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9
[jira] [Commented] (MESOS-10196) The task program runs successfully but the task status is failed
[ https://issues.apache.org/jira/browse/MESOS-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353705#comment-17353705 ] Charles Natali commented on MESOS-10196: Thanks [~934341445] for confirming - I'll close this ticket then. > The task program runs successfully but the task status is failed > - > > Key: MESOS-10196 > URL: https://issues.apache.org/jira/browse/MESOS-10196 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0, 1.10.0 > Environment: Ubuntu 16.04 > mesos master 1.10.0 > mesos slave 1.9.0 > python 3.7.3 >Reporter: clancyhuang >Priority: Major > > When testing mesos to execute the task by default executor, I found that the > task status is failed but in fact the task was executed successfully.I tested > two shell scripts, one is very simple > {code:sh} > python -V > /root/test.txt > {code} > ,The other is a script about image processing. > I am sure they are all working properly, but I get an > error:REASON_EXECUTOR_TERMINATED. > The stderr of the task has no output, and the stdout is correct,the mesos > agent has such log output > {code:bash} > I1104 11:34:35.337236 35682 slave.cpp:3657] Launching container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef for executor 'default' of framework > d915071b-c275-4321-afd5-134b86ebadf3-0002 > I1104 11:34:35.337371 35685 containerizer.cpp:1396] Starting container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:34:35.337563 35685 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from STARTING to > PROVISIONING after 76800ns > I1104 11:34:35.338893 35685 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PROVISIONING to > PREPARING after 1.321216ms > I1104 11:34:35.340224 35703 switchboard.cpp:316] Container logger module > finished preparing container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef; > IOSwitchboard server is not required > I1104 11:34:35.341944 35707 linux_launcher.cpp:492] Launching container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef and cloning with namespaces > I1104 11:34:35.346983 35704 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PREPARING to ISOLATING > after 8.082944ms > I1104 11:34:35.347719 35704 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from ISOLATING to FETCHING > after 730880ns > I1104 11:34:35.348254 35737 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from FETCHING to RUNNING > after 539136ns > I1104 11:34:58.060906 35680 slave.cpp:7406] Current disk usage 73.86%. Max > allowed age: 1.130070981247558days > I1104 11:35:58.062266 35708 slave.cpp:7406] Current disk usage 73.86%. Max > allowed age: 1.129549109651991days > I1104 11:36:58.062948 35741 slave.cpp:7406] Current disk usage 73.87%. Max > allowed age: 1.129005310066273days > I1104 11:37:58.063513 35703 slave.cpp:7406] Current disk usage 73.88%. Max > allowed age: 1.128437717518472days > I1104 11:38:30.242969 35740 containerizer.cpp:3161] Container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef has exited > I1104 11:38:30.243052 35740 containerizer.cpp:2620] Destroying container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef in RUNNING state > I1104 11:38:30.243072 35740 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from RUNNING to DESTROYING > after 3.9149140821mins > I1104 11:38:30.243252 35672 linux_launcher.cpp:576] Asked to destroy > container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.243350 35672 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef' > I1104 11:38:30.243768 35679 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.243961 35671 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after > 110848ns > I1104 11:38:30.244160 35683 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.244272 35683 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after > 67840ns > I1104 11:38:30.244668 35690 linux_launcher.cpp:650] Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef' > I1104 11:38:30.245975 35726 slave.cpp:6856] Executor 'default' of framework > d915071b-c275-4321-afd5-134b86ebadf3-0002 exited with status 0 > I1104 11:38:30.246995 35726 slave.cpp:5737] Handling status update > TASK_FAILED (Status
[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values
[ https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344916#comment-17344916 ] Charles Natali commented on MESOS-10216: OK I think the code in question is https://github.com/apache/mesos/blob/b8bfef6db158646df9fea6968bc75e88c32c3e21/src/log/leveldb.cpp#L101 The code indeed looks like it could suffer from overflow, however I'm not familiar with this part of the code base so I'll spend some time to understand exactly if it can be a problem in practice. > Replicated log key encoding overflows into negative values > -- > > Key: MESOS-10216 > URL: https://issues.apache.org/jira/browse/MESOS-10216 > Project: Mesos > Issue Type: Bug > Components: replicated log >Affects Versions: 1.11.0 >Reporter: Ilya >Priority: Major > > LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions > encoded as strings and padded with zeroes up to a certain fixed size. The > {{encode()}} function is incorrect because it uses the {{%d}} formatter that > expects an {{int}}. It also limits the key size to 10 digits which is OK for > {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}. > Because of this the available key range is reduced, and key overflow can > result in replica's {{METADATA}} record (position 0) being overwritten, which > in turn may cause data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10218) Mesos slave fails to connect after enabling ssl
[ https://issues.apache.org/jira/browse/MESOS-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344912#comment-17344912 ] Charles Natali commented on MESOS-10218: OK, then maybe fine to close [~apeters]? > Mesos slave fails to connect after enabling ssl > --- > > Key: MESOS-10218 > URL: https://issues.apache.org/jira/browse/MESOS-10218 > Project: Mesos > Issue Type: Bug > Components: agent >Affects Versions: 1.9.0 >Reporter: prasadkulkarni0711 >Priority: Major > > Mesos agent fails to connect to the master after setting the following > variables: > LIBPROCESS_SSL_ENABLED=1 > LIBPROCESS_SSL_KEY_FILE=/etc/mesos/conf/ssl/server.key > LIBPROCESS_SSL_CERT_FILE=/etc/mesos/conf/ssl/server.pem > LIBPROCESS_SSL_REQUIRE_CERT=false > LIBPROCESS_SSL_VERIFY_SERVER_CERT=false > LIBPROCESS_SSL_REQUIRE_CLIENT_CERT=false > LIBPROCESS_SSL_HOSTNAME_VALIDATION_SCHEME=openssl > LIBPROCESS_SSL_VERIFY_CERT=false > LIBPROCESS_SSL_CA_DIR=/etc/mesos/conf/ssl > LIBPROCESS_SSL_CA_FILE=/etc/mesos/conf/ssl/ca.pem > LIBPROCESS_SSL_SUPPORT_DOWNGRADE=false > LIBPROCESS_SSL_VERIFY_IPADD=false > #LIBPROCESS_SSL_ENABLE_TLS_V1_2=true > Error in logs: > Failed to accept socket: Failed accept: connection error: error:1407609C:SSL > routines:SSL23_GET_CLIENT_HELLO:http request > Connectivity works after setting: > LIBPROCESS_SSL_SUPPORT_DOWNGRADE=true > But then the sandbox fails to open in the web UI: > Potential reasons: > * The agent is not accessible > * The agent timed out or went offline > With the following error in the logs: > Failed to recv on socket 38 to peer 'unknown': Failed recv, connection error: > Connection reset by peer -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-9776) Mention removal of *.json endpoints in 1.8.0 CHANGELOG
[ https://issues.apache.org/jira/browse/MESOS-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344173#comment-17344173 ] Charles Natali commented on MESOS-9776: --- Since 1.8.0 was released a while ago this can probably be closed now. > Mention removal of *.json endpoints in 1.8.0 CHANGELOG > -- > > Key: MESOS-9776 > URL: https://issues.apache.org/jira/browse/MESOS-9776 > Project: Mesos > Issue Type: Improvement >Reporter: Benno Evers >Priority: Major > > We should mention in the CHANGELOG and update notes that the *.json that were > deprecated in Mesos 0.25 were actually removed in Mesos 1.8.0. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values
[ https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344104#comment-17344104 ] Charles Natali commented on MESOS-10216: [~ipronin] Any chance you could point at the offending code? > Replicated log key encoding overflows into negative values > -- > > Key: MESOS-10216 > URL: https://issues.apache.org/jira/browse/MESOS-10216 > Project: Mesos > Issue Type: Bug > Components: replicated log >Affects Versions: 1.11.0 >Reporter: Ilya >Priority: Major > > LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions > encoded as strings and padded with zeroes up to a certain fixed size. The > {{encode()}} function is incorrect because it uses the {{%d}} formatter that > expects an {{int}}. It also limits the key size to 10 digits which is OK for > {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}. > Because of this the available key range is reduced, and key overflow can > result in replica's {{METADATA}} record (position 0) being overwritten, which > in turn may cause data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce
[ https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343587#comment-17343587 ] Charles Natali commented on MESOS-10220: {{ldcache::parse}} is used in the rootfs code (https://github.com/apache/mesos/blob/96339efb53f7cdf1126ead7755d2b83b435e3263/src/tests/containerizer/rootfs.cpp#L123) gpu isolator (https://github.com/apache/mesos/blob/96339efb53f7cdf1126ead7755d2b83b435e3263/src/slave/containerizer/mesos/isolators/gpu/volume.cpp#L368) so would affect starting tasks but AFAICT not starting the master or agent. In any case it should be easy to fix, I'll look at it. > ldcache::parse failed to parse newer ld.so.cahce > > > Key: MESOS-10220 > URL: https://issues.apache.org/jira/browse/MESOS-10220 > Project: Mesos > Issue Type: Bug >Reporter: Minh H.G. >Assignee: Charles Natali >Priority: Minor > > In glibc 2.31, the ld.so.cache file no longer support old format (the one > start with "ld.so-1.7.0") > That cause ldcache::parse to fail and mesos cannot start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce
[ https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Charles Natali reassigned MESOS-10220: -- Assignee: Charles Natali > ldcache::parse failed to parse newer ld.so.cahce > > > Key: MESOS-10220 > URL: https://issues.apache.org/jira/browse/MESOS-10220 > Project: Mesos > Issue Type: Bug >Reporter: Minh H.G. >Assignee: Charles Natali >Priority: Minor > > In glibc 2.31, the ld.so.cache file no longer support old format (the one > start with "ld.so-1.7.0") > That cause ldcache::parse to fail and mesos cannot start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce
[ https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343570#comment-17343570 ] Charles Natali edited comment on MESOS-10220 at 5/12/21, 8:56 PM: -- And also it'd be great if you could attach your {{ld.so.cache}}, it'll be easier to test. Nevermind, the cache in the new format can be easily reproduced with {{ldconfig -c new}}. However both the master and agent seem to start fine with it, so it'd be really helpful to have a log if they fail to start. was (Author: cf.natali): And also it'd be great if you could attach your {{ld.so.cache}}, it'll be easier to test. > ldcache::parse failed to parse newer ld.so.cahce > > > Key: MESOS-10220 > URL: https://issues.apache.org/jira/browse/MESOS-10220 > Project: Mesos > Issue Type: Bug >Reporter: Minh H.G. >Priority: Minor > > In glibc 2.31, the ld.so.cache file no longer support old format (the one > start with "ld.so-1.7.0") > That cause ldcache::parse to fail and mesos cannot start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce
[ https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343570#comment-17343570 ] Charles Natali commented on MESOS-10220: And also it'd be great if you could attach your {{ld.so.cache}}, it'll be easier to test. > ldcache::parse failed to parse newer ld.so.cahce > > > Key: MESOS-10220 > URL: https://issues.apache.org/jira/browse/MESOS-10220 > Project: Mesos > Issue Type: Bug >Reporter: Minh H.G. >Priority: Minor > > In glibc 2.31, the ld.so.cache file no longer support old format (the one > start with "ld.so-1.7.0") > That cause ldcache::parse to fail and mesos cannot start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce
[ https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343555#comment-17343555 ] Charles Natali commented on MESOS-10220: Hey [~hgminh], Thanks for the report - would it be possible to attach a log of the agent or master when they fail to start? > ldcache::parse failed to parse newer ld.so.cahce > > > Key: MESOS-10220 > URL: https://issues.apache.org/jira/browse/MESOS-10220 > Project: Mesos > Issue Type: Bug >Reporter: Minh H.G. >Priority: Minor > > In glibc 2.31, the ld.so.cache file no longer support old format (the one > start with "ld.so-1.7.0") > That cause ldcache::parse to fail and mesos cannot start. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10105) Make tests of builds with -fsanitize=address/memory/undefined/thread pass.
[ https://issues.apache.org/jira/browse/MESOS-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322254#comment-17322254 ] Charles Natali commented on MESOS-10105: Regarding {{-fsanitize=address}} , leak detection can be disabled with {{ASAN_OPTIONS=detect_leaks=0}}. However a couple tests still fail: {{ [ FAILED ] HTTPCommandExecutorTest.TerminateWithACK [ FAILED ] PosixRLimitsIsolatorTest.UnsetLimits [ FAILED ] MesosContainerizer/DefaultExecutorTest.KillTask/0, where GetParam() = "mesos" [ FAILED ] MesosContainerizer/DefaultExecutorTest.CommitSuicideOnTaskFailure/0, where GetParam() = "mesos" [ FAILED ] MesosContainerizer/DefaultExecutorTest.CommitSuicideOnKillTask/0, where GetParam() = "mesos" [ FAILED ] MesosContainerizer/DefaultExecutorTest.MaxCompletionTime/0, where GetParam() = "mesos" }} All of them except {{PosixRLimitsIsolatorTest.UnsetLimits}} fail because they don't propagate the {{ASAN_OPTIONS}} environment variable. > Make tests of builds with -fsanitize=address/memory/undefined/thread pass. > --- > > Key: MESOS-10105 > URL: https://issues.apache.org/jira/browse/MESOS-10105 > Project: Mesos > Issue Type: Wish >Reporter: Andrei Sekretenko >Priority: Critical > > As exemplified by various C++ projects and also by targeting specific issues > in Mesos (for example, MESOS-10102), running code built with clang sanitizers > helps with uncovering undefined behavior and data races. > Sanitizer adoption usually happens as a sequence of steps which unblock each > other: > 1) making local tests pass under sanitizer at least once > 2) making CI regularly run sanitizer builds (so that new sanitizable bugs are > not introduced and more bugs not triggered deterministically are uncovered) > 3) running high-level integration tests, betas, etc. with sanitizer builds > -- > (3) is definitely out of scope of this wish, and it is not clear if (2) will > fit into ASF CI, but (1) is definitely doable, and on its own can lead to > figuring out causes of mysterious rare bugs (which might turn out to be not > so rare under certain conditions). > -- > State of Mesos w.r.t sanitizers: > - as of Mar 2020, Mesos tests built with -fsanitize=address crash due to > several locations that leak one object per thread lifetime > - as of Nov 2019, libprocess tests were crashing thread sanitizer; IIRC, the > issues in libprocess on Linux/amd64 are also "technical", but probably could > result in a very real problems on a different platform -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10196) The task program runs successfully but the task status is failed
[ https://issues.apache.org/jira/browse/MESOS-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321176#comment-17321176 ] Charles Natali commented on MESOS-10196: Hey [~934341445] , sorry for the delay. I know it's been a while, but in case it's still an issue, I think the next step would be to run the following command to see exactly what's going on - my guess is that the agent is maybe not starting the right executor or something like that: {code} strace -ttTf -p -o agent.strace {code} And attach {{agent.strace}} together with the agent logs. It could also be useful to start the agent with {{GLOG_v=9}} environment variable to get detailed logs. > The task program runs successfully but the task status is failed > - > > Key: MESOS-10196 > URL: https://issues.apache.org/jira/browse/MESOS-10196 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0, 1.10.0 > Environment: Ubuntu 16.04 > mesos master 1.10.0 > mesos slave 1.9.0 > python 3.7.3 >Reporter: clancyhuang >Priority: Major > > When testing mesos to execute the task by default executor, I found that the > task status is failed but in fact the task was executed successfully.I tested > two shell scripts, one is very simple > {code:sh} > python -V > /root/test.txt > {code} > ,The other is a script about image processing. > I am sure they are all working properly, but I get an > error:REASON_EXECUTOR_TERMINATED. > The stderr of the task has no output, and the stdout is correct,the mesos > agent has such log output > {code:bash} > I1104 11:34:35.337236 35682 slave.cpp:3657] Launching container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef for executor 'default' of framework > d915071b-c275-4321-afd5-134b86ebadf3-0002 > I1104 11:34:35.337371 35685 containerizer.cpp:1396] Starting container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:34:35.337563 35685 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from STARTING to > PROVISIONING after 76800ns > I1104 11:34:35.338893 35685 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PROVISIONING to > PREPARING after 1.321216ms > I1104 11:34:35.340224 35703 switchboard.cpp:316] Container logger module > finished preparing container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef; > IOSwitchboard server is not required > I1104 11:34:35.341944 35707 linux_launcher.cpp:492] Launching container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef and cloning with namespaces > I1104 11:34:35.346983 35704 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PREPARING to ISOLATING > after 8.082944ms > I1104 11:34:35.347719 35704 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from ISOLATING to FETCHING > after 730880ns > I1104 11:34:35.348254 35737 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from FETCHING to RUNNING > after 539136ns > I1104 11:34:58.060906 35680 slave.cpp:7406] Current disk usage 73.86%. Max > allowed age: 1.130070981247558days > I1104 11:35:58.062266 35708 slave.cpp:7406] Current disk usage 73.86%. Max > allowed age: 1.129549109651991days > I1104 11:36:58.062948 35741 slave.cpp:7406] Current disk usage 73.87%. Max > allowed age: 1.129005310066273days > I1104 11:37:58.063513 35703 slave.cpp:7406] Current disk usage 73.88%. Max > allowed age: 1.128437717518472days > I1104 11:38:30.242969 35740 containerizer.cpp:3161] Container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef has exited > I1104 11:38:30.243052 35740 containerizer.cpp:2620] Destroying container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef in RUNNING state > I1104 11:38:30.243072 35740 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from RUNNING to DESTROYING > after 3.9149140821mins > I1104 11:38:30.243252 35672 linux_launcher.cpp:576] Asked to destroy > container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.243350 35672 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef' > I1104 11:38:30.243768 35679 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.243961 35671 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after > 110848ns > I1104 11:38:30.244160 35683 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.244272 35683 cgroups.cpp:1271] Successfully thawed cgroup >
[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"
[ https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321152#comment-17321152 ] Charles Natali commented on MESOS-10131: I think this could possibly happen without a loop in {{/proc/PID/mountinfo}} because reading from {{/proc/PID/mountinfo}} isn't atomic - definitely not if it can't be read in a single {{read}} syscall, which is very likely the case here since it's larger than 30K. Could explain why it happens randomly especially if there are many short-lived tasks being started. Since it didn't re-occur and the potential fix for it would be far from trivial, probably time to close. > Agent frequently dies with error "Cycle found in mount table hierarchy" > --- > > Key: MESOS-10131 > URL: https://issues.apache.org/jira/browse/MESOS-10131 > Project: Mesos > Issue Type: Bug > Components: agent, framework >Affects Versions: 1.9.0 >Reporter: Thomas Plummer >Assignee: Andrei Budnik >Priority: Major > Attachments: log.txt > > > Our mesos agent frequently dies with the follow error in the slave logs: > > {code:java} > F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: > !visitedParents.contains(parentId) Cycle found in mount table hierarchy at > entry '1954': > 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs > rw,seclabel > 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw > 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs > rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755 > 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - > securityfs securityfs rw > 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs > rw,seclabel > 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts > rw,seclabel,gid=5,mode=620,ptmxmode=000 > 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755 > 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs > ro,seclabel,mode=755 > 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 > - cgroup cgroup > rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd > 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - > pstore pstore rw > 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime > shared:21 - efivarfs efivarfs rw > 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime > shared:10 - cgroup cgroup rw,seclabel,perf_event > 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime > shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls > 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 > - cgroup cgroup rw,seclabel,cpuset > 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - > cgroup cgroup rw,seclabel,blkio > 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 > - cgroup cgroup rw,seclabel,freezer > 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 > - cgroup cgroup rw,seclabel,hugetlb > 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 > - cgroup cgroup rw,seclabel,devices > 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime > shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu > 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 > - cgroup cgroup rw,seclabel,memory > 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - > cgroup cgroup rw,seclabel,pids > 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw > 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw > 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs > systemd-1 > rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414 > 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw > 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel > 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs > rw,seclabel > 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 > rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro > 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var > rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota > 50
[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values
[ https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321094#comment-17321094 ] Charles Natali commented on MESOS-10216: Yes that'd be really interesting, from memory there's a once in a blue moon bug involving leveldb corruption which could potentially be explained by this. > Replicated log key encoding overflows into negative values > -- > > Key: MESOS-10216 > URL: https://issues.apache.org/jira/browse/MESOS-10216 > Project: Mesos > Issue Type: Bug > Components: replicated log >Affects Versions: 1.11.0 >Reporter: Ilya >Priority: Major > > LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions > encoded as strings and padded with zeroes up to a certain fixed size. The > {{encode()}} function is incorrect because it uses the {{%d}} formatter that > expects an {{int}}. It also limits the key size to 10 digits which is OK for > {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}. > Because of this the available key range is reduced, and key overflow can > result in replica's {{METADATA}} record (position 0) being overwritten, which > in turn may cause data loss. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10210) Bundled grpc doesn't compile with glibc 2.30+
[ https://issues.apache.org/jira/browse/MESOS-10210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271740#comment-17271740 ] Charles Natali commented on MESOS-10210: Merged by [~bmahler] so can be closed. > Bundled grpc doesn't compile with glibc 2.30+ > - > > Key: MESOS-10210 > URL: https://issues.apache.org/jira/browse/MESOS-10210 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.11.0 >Reporter: Omer Ozarslan >Priority: Minor > > Bundled grpc failed to link with glibc 2.31 since starting with 2.30 glibc > declares its own gettid function with same signature. Cherry picking two > commits from below two PRs from the upstream fixes the issue: > * [https://github.com/grpc/grpc/pull/20048] > * [https://github.com/grpc/grpc/pull/18950] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash
[ https://issues.apache.org/jira/browse/MESOS-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271739#comment-17271739 ] Charles Natali commented on MESOS-10146: Looking at the 1.9.0 code I think I found what caused this, however looking at master I believe it's been fixed by this commit: [https://github.com/apache/mesos/commit/6be17200b8084ad3524e7d450c411765b3214c0f] for this issue: [https://issues.apache.org/jira/projects/MESOS/issues/MESOS-9609|https://issues.apache.org/jira/projects/MESOS/issues/MESOS-9609?filter=allissues] So I think this can be closed as duplicate of #9609. > Removing task from slave when framework is disconnected causes master to crash > -- > > Key: MESOS-10146 > URL: https://issues.apache.org/jira/browse/MESOS-10146 > Project: Mesos > Issue Type: Bug > Components: c++ api, framework >Affects Versions: 1.9.0 > Environment: Mesos master with three master nodes >Reporter: Naveen >Priority: Blocker > > Hello, > we want to report an issue we observed when remove tasks from slave. > There is condition to check for valid framework before tasks can be removed. > There can be several reasons framework can be disconnected. This check fails > and crashes mesos master node. > [https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842] > There is also unguarded access to the internal framework state on line 11853. > Error logs - > {noformat} > mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health > check timed out > mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check > failed: framework != nullptr Framework > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent > 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 > (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } > } > mesos-master[5483]: *** Check failure stack trace: *** > mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed > all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 > mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed > agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 > mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica > received learned notice for position 42070 from > log-network(1)@10.160.73.212:5050 > mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail() > mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog() > mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush() > mesos-master[5483]: @ 0x7f2fdf6a8859 > google::LogMessageFatal::~LogMessageFatal() > mesos-master[5483]: @ 0x7f2fde2677f2 > mesos::internal::master::Master::__removeSlave() > mesos-master[5483]: @ 0x7f2fde267ebe > mesos::internal::master::Master::_markUnreachable() > mesos-master[5483]: @ 0x7f2fde268215 > _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbclEv > mesos-master[5483]: @ 0x7f2fddf30688 > _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEclEOS3_ > mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume() > mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume() > mesos-master[5483]: @ 0x7f2fdf60cb36 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv > mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine > mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread > mesos-master[5483]: @ 0x7f2fdb20e8dd __clone > systemd[1]: mesos-master.service: main process exited, code=killed, > status=6/ABRT > systemd[1]: Unit mesos-master.service entered failed state. > systemd[1]: mesos-master.service failed. > systemd[1]: mesos-master.service holdoff time over, scheduling restart. > systemd[1]: Stopped Mesos Master. > systemd[1]: Started Mesos Master. > mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level > logging started! > mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: > 2020-05-09 10:42:00 by centos > mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0 > mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0 > mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: > 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat} > --
[jira] [Commented] (MESOS-10196) The task program runs successfully but the task status is failed
[ https://issues.apache.org/jira/browse/MESOS-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235049#comment-17235049 ] Charles Natali commented on MESOS-10196: It's surprising there's no log of the executor registering and sending any task update - do you have the executor's log? Also how do you start the tasks, do you use your own framework code? > The task program runs successfully but the task status is failed > - > > Key: MESOS-10196 > URL: https://issues.apache.org/jira/browse/MESOS-10196 > Project: Mesos > Issue Type: Bug > Components: executor >Affects Versions: 1.9.0, 1.10.0 > Environment: Ubuntu 16.04 > mesos master 1.10.0 > mesos slave 1.9.0 > python 3.7.3 >Reporter: clancyhuang >Priority: Major > > When testing mesos to execute the task by default executor, I found that the > task status is failed but in fact the task was executed successfully.I tested > two shell scripts, one is very simple > {code:sh} > python -V > /root/test.txt > {code} > ,The other is a script about image processing. > I am sure they are all working properly, but I get an > error:REASON_EXECUTOR_TERMINATED. > The stderr of the task has no output, and the stdout is correct,the mesos > agent has such log output > {code:bash} > I1104 11:34:35.337236 35682 slave.cpp:3657] Launching container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef for executor 'default' of framework > d915071b-c275-4321-afd5-134b86ebadf3-0002 > I1104 11:34:35.337371 35685 containerizer.cpp:1396] Starting container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:34:35.337563 35685 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from STARTING to > PROVISIONING after 76800ns > I1104 11:34:35.338893 35685 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PROVISIONING to > PREPARING after 1.321216ms > I1104 11:34:35.340224 35703 switchboard.cpp:316] Container logger module > finished preparing container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef; > IOSwitchboard server is not required > I1104 11:34:35.341944 35707 linux_launcher.cpp:492] Launching container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef and cloning with namespaces > I1104 11:34:35.346983 35704 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PREPARING to ISOLATING > after 8.082944ms > I1104 11:34:35.347719 35704 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from ISOLATING to FETCHING > after 730880ns > I1104 11:34:35.348254 35737 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from FETCHING to RUNNING > after 539136ns > I1104 11:34:58.060906 35680 slave.cpp:7406] Current disk usage 73.86%. Max > allowed age: 1.130070981247558days > I1104 11:35:58.062266 35708 slave.cpp:7406] Current disk usage 73.86%. Max > allowed age: 1.129549109651991days > I1104 11:36:58.062948 35741 slave.cpp:7406] Current disk usage 73.87%. Max > allowed age: 1.129005310066273days > I1104 11:37:58.063513 35703 slave.cpp:7406] Current disk usage 73.88%. Max > allowed age: 1.128437717518472days > I1104 11:38:30.242969 35740 containerizer.cpp:3161] Container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef has exited > I1104 11:38:30.243052 35740 containerizer.cpp:2620] Destroying container > 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef in RUNNING state > I1104 11:38:30.243072 35740 containerizer.cpp:3323] Transitioning the state > of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from RUNNING to DESTROYING > after 3.9149140821mins > I1104 11:38:30.243252 35672 linux_launcher.cpp:576] Asked to destroy > container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.243350 35672 linux_launcher.cpp:618] Destroying cgroup > '/sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef' > I1104 11:38:30.243768 35679 cgroups.cpp:2854] Freezing cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.243961 35671 cgroups.cpp:1242] Successfully froze cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after > 110848ns > I1104 11:38:30.244160 35683 cgroups.cpp:2872] Thawing cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef > I1104 11:38:30.244272 35683 cgroups.cpp:1271] Successfully thawed cgroup > /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after > 67840ns > I1104 11:38:30.244668 35690 linux_launcher.cpp:650] Destroying cgroup > '/sys/fs/cgroup/systemd/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef' > I1104 11:38:30.245975 35726 slave.cpp:6856] Executor 'default' of framework >
[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101162#comment-17101162 ] Charles Natali commented on MESOS-8038: --- The more I think about it the more I think that the current behavior of optimistically releasing the resources is very sub-optimal. We've had cgroup destruction fail for various reasons in our cluster: * kernel bugs - see https://issues.apache.org/jira/browse/MESOS-10107 * tasks stuck in uninterruptible sleep, e.g. NFS I/O When this happens, it triggers at least the following problems: * this issue with GPUs, which cause all subsequent tasks scheduled on the host trying to use the GPU to fail, effectively a black hole * another problem where some stacks stuck in uninterruptible sleep were still consuming memory, so the agent overcommitted memory causing tasks to run OOM further down the line "Leaking" CPU is mostly fine because it's a compressible resource and stuck tasks generally don't use it, but it's pretty bad for memory and GPU, causing errors which are hard to diagnose and automatically recover from. > Launching GPU task sporadically fails. > -- > > Key: MESOS-8038 > URL: https://issues.apache.org/jira/browse/MESOS-8038 > Project: Mesos > Issue Type: Bug > Components: containerization, gpu >Affects Versions: 1.4.0 >Reporter: Sai Teja Ranuva >Assignee: Zhitao Li >Priority: Critical > Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, > mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py > > > I was running a job which uses GPUs. It runs fine most of the time. > But occasionally I see the following message in the mesos log. > "Collect failed: Requested 1 but only 0 available" > Followed by executor getting killed and the tasks getting lost. This happens > even before the the job starts. A little search in the code base points me to > something related to GPU resource being the probable cause. > There is no deterministic way that this can be reproduced. It happens > occasionally. > I have attached the slave log for the issue. > Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090969#comment-17090969 ] Charles Natali commented on MESOS-8038: --- See log attached ([^mesos_agent.log]). See my interpretation below, keeping in mind that I'm not familiar with the code so might be completely wrong :). Before the error occurs, we can see the following warning in the agent log: {noformat} W0423 22:46:19.277667 20524 containerizer.cpp:2428] Ignoring update for currently being destroyed container 6f446173-2bba-4cc4-bc15-c956bc159d4e {noformat} Looking at the logs, we can see that compared to a successful run, the containerizer's update method is called while the container is being destroyed. Example of a successful task - the slave receives the task status update, forwards it, and then sends back the acknowledgement which cause sthe executor to exit, and the container to be destroyed, therefore after the task status update has been processed: {noformat} I0423 22:43:55.771867 20519 slave.cpp:5950] Handling status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 0142aec2-d0c1-4011-8340-d81107d40fce- from executor(1)@127.0.1.1:35471 I0423 22:43:55.787933 20518 memory.cpp:287] Updated 'memory.soft_limit_in_bytes' to 32MB for container 5a7984a6-cecb-40ea-843c-8ed28cd92330 I0423 22:43:55.788524 20523 cpu.cpp:94] Updated 'cpu.shares' to 102 (cpus 0.1) for container 5a7984a6-cecb-40ea-843c-8ed28cd92330 I0423 22:43:55.794132 20522 task_status_update_manager.cpp:328] Received task status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 0142aec2-d0c1-4011-8340-d81107d40fce- I0423 22:43:55.794495 20522 task_status_update_manager.cpp:383] Forwarding task status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 0142aec2-d0c1-4011-8340-d81107d40fce- to the agent I0423 22:43:55.795053 20522 slave.cpp:6496] Forwarding the update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 0142aec2-d0c1-4011-8340-d81107d40fce- to master@127.0.0.1:5050 I0423 22:43:55.812129 20522 slave.cpp:6380] Task status update manager successfully handled status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 0142aec2-d0c1-4011-8340-d81107d40fce- I0423 22:43:55.813238 20522 slave.cpp:6407] Sending acknowledgement for status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 0142aec2-d0c1-4011-8340-d81107d40fce- to executor(1)@127.0.1.1:35471 [...] I0423 22:43:57.005844 20521 slave.cpp:6676] Got exited event for executor(1)@127.0.1.1:35471 I0423 22:43:57.205157 20522 containerizer.cpp:3159] Container 5a7984a6-cecb-40ea-843c-8ed28cd92330 has exited I0423 22:43:57.205278 20522 containerizer.cpp:2623] Destroying container 5a7984a6-cecb-40ea-843c-8ed28cd92330 in RUNNING state I0423 22:43:57.205379 20522 containerizer.cpp:3321] Transitioning the state of container 5a7984a6-cecb-40ea-843c-8ed28cd92330 from RUNNING to DESTROYING after 4.612041984secs I0423 22:43:57.206100 20523 linux_launcher.cpp:564] Asked to destroy container 5a7984a6-cecb-40ea-843c-8ed28cd92330 {noformat} Now let's look at what happens when the task right before the task which fails with "Requested 1 gpus but only 0 available" finishes: {noformat} I0423 22:46:16.506460 20519 slave.cpp:5950] Handling status update TASK_FINISHED (Status UUID: 4b7a01c5-15af-47a3-b06b-5ed8f7d65405) for task task-650af3bd-3f5b-4e17-9d34-4642480b4da0 of framework 0142aec2-d0c1-4011-8340-d81107d40fce- from executor(1)@127.0.1.1:36541 I0423 22:46:17.560580 20521 slave.cpp:6676] Got exited event for executor(1)@127.0.1.1:36541 I0423 22:46:18.701063 20523 linux_launcher.cpp:638] Destroying cgroup '/sys/fs/cgroup/systemd/mesos/8a4e52e5-eab6-43e7-8bd1-9b9248614e69' I0423 22:46:19.236407 20525 slave.cpp:7076] Executor 'task-376cdda6-760b-4d3b-ad7f-2d86916695a3' of framework 0142aec2-d0c1-4011-8340-d81107d40fce- exited with status 0 I0423 22:46:19.237376 20525 slave.cpp:7187] Cleaning up executor 'task-376cdda6-760b-4d3b-ad7f-2d86916695a3' of framework 0142aec2-d0c1-4011-8340-d81107d40fce- at executor(1)@127.0.1.1:41227 I0423 22:46:19.241185 20522 gc.cpp:95] Scheduling '/tmp/mesos_agent/work/slaves/0142aec2-d0c1-4011-8340-d81107d40fce-S0/frameworks/0142aec2-d0c1-4011-8340-d81107d40fce-/executors/task-376cdda6-760b-4d3b-ad7f-2d86916695a3/runs/0c718138-d6f3-42d4-9de7-4dac7d518dc5' for gc 9.9959984512mins in the future I0423
[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005 ] Charles Natali edited comment on MESOS-8038 at 4/21/20, 7:46 PM: - [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After less than a minute, tasks start failing with this error: {noformat} Failed to launch container: Requested 1 gpus but only 0 available{noformat} I'll try to see if I can find a simpler reproducer, but this seems to fail systematically for me. was (Author: cf.natali): [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After a few minutes, tasks start failing with this error: {noformat} Failed to launch container: Requested 1 gpus but only 0 available{noformat} I'll try to see if I can find a simpler reproducer, but this to fail systematically for me. > Launching GPU task sporadically fails. > -- > > Key: MESOS-8038 > URL: https://issues.apache.org/jira/browse/MESOS-8038 > Project: Mesos > Issue Type: Bug > Components: containerization, gpu >Affects Versions: 1.4.0 >Reporter: Sai Teja Ranuva >Assignee: Zhitao Li >Priority: Critical > Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, > mesos-slave.INFO.log, start_short_tasks_gpu.py > > > I was running a job which uses GPUs. It runs fine most of the time. > But occasionally I see the following message in the mesos log. > "Collect failed: Requested 1 but only 0 available" > Followed by executor getting killed and the tasks getting lost. This happens > even before the the job starts. A little search in the code base points me to > something related to GPU resource being the probable cause. > There is no deterministic way that this can be reproduced. It happens > occasionally. > I have attached the slave log for the issue. > Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005 ] Charles Natali edited comment on MESOS-8038 at 4/21/20, 7:32 PM: - [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After a few minutes, tasks start failing with this error: {noformat} Failed to launch container: Requested 1 gpus but only 0 available{noformat} I'll try to see if I can find a simpler reproducer, but this to fail systematically for me. was (Author: cf.natali): [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks while allocate 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After a few minutes, tasks start failing with this error: Failed to launch container: Requested 1 gpus but only 0 available I'll try to see if I can find a simpler reproducer, but this to fail systematically for me. > Launching GPU task sporadically fails. > -- > > Key: MESOS-8038 > URL: https://issues.apache.org/jira/browse/MESOS-8038 > Project: Mesos > Issue Type: Bug > Components: containerization, gpu >Affects Versions: 1.4.0 >Reporter: Sai Teja Ranuva >Assignee: Zhitao Li >Priority: Critical > Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, > mesos-slave.INFO.log, start_short_tasks_gpu.py > > > I was running a job which uses GPUs. It runs fine most of the time. > But occasionally I see the following message in the mesos log. > "Collect failed: Requested 1 but only 0 available" > Followed by executor getting killed and the tasks getting lost. This happens > even before the the job starts. A little search in the code base points me to > something related to GPU resource being the probable cause. > There is no deterministic way that this can be reproduced. It happens > occasionally. > I have attached the slave log for the issue. > Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.
[ https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005 ] Charles Natali commented on MESOS-8038: --- [~bmahler] I have a way to reproduce it systematically, albeit very contrived: using syscall fault injection. Basically I just continuously start tasks while allocate 1 GPU and just do "exit 0" (see attach python framework). Then, I run the following - inject a few seconds delay in all rmdir syscalls made by the agent: {noformat} # strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o /dev/null {noformat} After a few minutes, tasks start failing with this error: Failed to launch container: Requested 1 gpus but only 0 available I'll try to see if I can find a simpler reproducer, but this to fail systematically for me. > Launching GPU task sporadically fails. > -- > > Key: MESOS-8038 > URL: https://issues.apache.org/jira/browse/MESOS-8038 > Project: Mesos > Issue Type: Bug > Components: containerization, gpu >Affects Versions: 1.4.0 >Reporter: Sai Teja Ranuva >Assignee: Zhitao Li >Priority: Critical > Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, > mesos-slave.INFO.log, start_short_tasks_gpu.py > > > I was running a job which uses GPUs. It runs fine most of the time. > But occasionally I see the following message in the mesos log. > "Collect failed: Requested 1 but only 0 available" > Followed by executor getting killed and the tasks getting lost. This happens > even before the the job starts. A little search in the code base points me to > something related to GPU resource being the probable cause. > There is no deterministic way that this can be reproduced. It happens > occasionally. > I have attached the slave log for the issue. > Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU
[ https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088973#comment-17088973 ] Charles Natali commented on MESOS-10119: So for the good news: I couldn't reproduce it - it turned out to be a bug in one of our legacy systems which caused it to remove the agent's cgroups... However I did observe this particular failure as a consequence of the now fixed https://issues.apache.org/jira/browse/MESOS-10107 > Marking as a duplicate of MESOS-8038. Ah, let's close this one then. > failure to destroy container can cause the agent to "leak" a GPU > > > Key: MESOS-10119 > URL: https://issues.apache.org/jira/browse/MESOS-10119 > Project: Mesos > Issue Type: Task > Components: agent, containerization >Reporter: Charles Natali >Priority: Major > > At work we hit the following problem: > # cgroup for a task using the GPU isolation failed to be destroyed on OOM > # the agent continued advertising the GPU as available > # all subsequent attempts to start tasks using a GPU fails with "Requested 1 > gpus but only 0 available" > Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can > be tackled separately, however the fact that the agent basically leaks the > GPU is pretty bad, because it basically turns into /dev/null, failing all > subsequent tasks requesting a GPU. > > See the logs: > > > {noformat} > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 > memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 > memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 > memory.cpp:686] Failed to read 'memory.stat': No such file or directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 > memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 > memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or > directory > Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 > memory.cpp:686] Failed to read 'memory.stat': No such file or directory > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 > slave.cpp:6994] Termination of executor > 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an > isolator when destroying container: Failed to destroy cgroups: Failed to get > nested cgroups: Failed to determine canonical path of > '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such > file or directory > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 > containerizer.cpp:2567] Skipping status for container > 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 > containerizer.cpp:2428] Ignoring update for currently being destroyed > container 8ef00748-b640-4620-97dc-f719e9775e88 > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 > slave.cpp:6994] Termination of executor > 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all > processes in the container: Failed to remove cgroup > 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup > '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device > or resource busy > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 > containerizer.cpp:2567] Skipping status for container > 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist > Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 > containerizer.cpp:2428] Ignoring update for currently being destroyed > container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 > slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor > 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework > c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus > but only 0 available > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 > memory.cpp:637] Listening on OOM events failed for container > 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating > Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 > containerizer.cpp:2421] Ignoring
[jira] [Created] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU
Charles Natali created MESOS-10119: -- Summary: failure to destroy container can cause the agent to "leak" a GPU Key: MESOS-10119 URL: https://issues.apache.org/jira/browse/MESOS-10119 Project: Mesos Issue Type: Task Components: agent, containerization Reporter: Charles Natali At work we hit the following problem: # cgroup for a task using the GPU isolation failed to be destroy on OOM # the agent continued advertising the GPU as available # all subsequent attempts to start tasks using a GPU fails with "Requested 1 gpus but only 0 available" Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can be tackled separately, however the fact that the agent basically leaks the GPU is pretty bad, because it basically turns into /dev/null, failing all subsequent tasks requesting a gpu See the logs: {noformat} Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or directory Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or directory Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 memory.cpp:686] Failed to read 'memory.stat': No such file or directory Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or directory Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or directory Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 memory.cpp:686] Failed to read 'memory.stat': No such file or directory Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 slave.cpp:6994] Termination of executor 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an isolator when destroying container: Failed to destroy cgroups: Failed to get nested cgroups: Failed to determine canonical path of '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such file or directory Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 containerizer.cpp:2567] Skipping status for container 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 containerizer.cpp:2428] Ignoring update for currently being destroyed container 8ef00748-b640-4620-97dc-f719e9775e88 Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 slave.cpp:6994] Termination of executor 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all processes in the container: Failed to remove cgroup 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device or resource busy Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 containerizer.cpp:2567] Skipping status for container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 containerizer.cpp:2428] Ignoring update for currently being destroyed container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus but only 0 available Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 memory.cpp:637] Listening on OOM events failed for container 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 containerizer.cpp:2421] Ignoring update for unknown container 87253521-8d39-47ea-b4d1-febe527d230c Apr 17 17:00:08 engpuc006 mesos-slave[2068]: W0417 17:00:08.109067 2154 process.cpp:1480] Failed to link to '172.16.22.201:34059', connect: Failed connect: connection closed Apr 17 17:00:10 engpuc006 mesos-slave[2068]: E0417 17:00:10.310817 2141 slave.cpp:6889] Container '257b45f1-8582-4cb5-8138-454e9697bfe4' for executor 'task_3:6bdd99ca-7a2b-f19c-bbb3-d9478fe8f81e' of framework c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus but only 0 available Apr 17 17:00:10 engpuc006 mesos-slave[2068]: E0417 17:00:10.311614 2141 memory.cpp:637] Listening on OOM events failed for container 257b45f1-8582-4cb5-8138-454e9697bfe4: Event listener is terminating
[jira] [Commented] (MESOS-10110) Libprocess ignores most protobuf (de)serialisation failure cases.
[ https://issues.apache.org/jira/browse/MESOS-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077633#comment-17077633 ] Charles Natali commented on MESOS-10110: Hey. It's probably my fault - I just created another account - this one, "cf.natali". That's because my previous account "charle" obviously contained a typo and also didn't match the username I used for the [https://reviews.apache.org/] and apparently Jira doesn't support chaning usernames. Hope I didn't make too much of a mess! > Libprocess ignores most protobuf (de)serialisation failure cases. > - > > Key: MESOS-10110 > URL: https://issues.apache.org/jira/browse/MESOS-10110 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Charles >Priority: Major > > Before the code didn't check at all the return value of > {{Message::SerializeToString}}, which can fail for various reasons, > e.g. out-of-memory, message too large, or invalid UTF-8 string. > Also, the way deserialisation was checked for error using > {{Message::IsInitialized}} doesn't detect errors such as the above, > we need to check {{Message::ParseFromString}} return value. > {{}} > We noticed this at work because our custom executor had a bug causing it to > send invalid/non-UTF8 {{mesos.TaskID}}, but it was successfully serialised by > the executor (driver), and deserialised by the framework, which was blowing > it to blow up at later point far from the original source of the problem. > More generally we want to catch such invalid messages - which can happen for > a variety of reasons - as early as possible. -- This message was sent by Atlassian Jira (v8.3.4#803005)