[jira] [Commented] (MESOS-10239) Installing Mesos on Oracle Linux 8.3

2022-09-15 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10239?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17605513#comment-17605513
 ] 

Charles Natali commented on MESOS-10239:


Hi [~Mar_zieh],

You don't need Python to install Mesos, unless you use Python bindings.
If you're building from the source, you can just pass {{--disable-python}} as 
describe here: 
https://mesos.apache.org/documentation/latest/configuration/autotools/

Could you please details the error you're getting?

> Installing Mesos on Oracle Linux 8.3
> 
>
> Key: MESOS-10239
> URL: https://issues.apache.org/jira/browse/MESOS-10239
> Project: Mesos
>  Issue Type: Task
>Reporter: Marzieh
>Priority: Major
>
> some new versions of Linux like Oracle Linux 8,  Redhat 8 , does not support 
> Python2 any more,however Mesos need to Python2. So, there is no way to 
> install Mesos in these environments.
> Would you please make Mesos updated to be installed in new Linux 
> distributions?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos

2022-08-24 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17584433#comment-17584433
 ] 

Charles Natali commented on MESOS-10234:


Hi Sangita,

if this is an issue for you, you can simply use whatever zookeeper version you 
want, you do not need to use the shipped one.

We could update zookeeper separately, the shipped version is quite old and has 
some known bugs - [~qianzhang] what do you think?

> CVE-2021-44228 Log4j vulnerability for apache mesos
> ---
>
> Key: MESOS-10234
> URL: https://issues.apache.org/jira/browse/MESOS-10234
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11.0
>Reporter: Sangita Nalkar
>Priority: Critical
>
> Hi,
> Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache 
> mesos.
> We see that log4j v1.2.17 is used while building apache mesos from source.
> Snippet from build logs:
> std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF 
> jvm/org/apache/.deps/libjava_la-log4j.Tpo -c 
> ../../src/jvm/org/apache/log4j.cpp  -fPIC -DPIC -o 
> jvm/org/apache/.libs/libjava_la-log4j.o
> Thanks,
> Sangita



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (MESOS-10237) Mesos-slave issue report

2022-03-24 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17512051#comment-17512051
 ] 

Charles Natali commented on MESOS-10237:


Hi [~feixiachao],

Are you having a specific problem or just wondering about those error messages?
Those errors are benign and can be ignored - they've actually been fixed in 
master: 
https://github.com/apache/mesos/commit/6bc5a5e114077f542f7258adffb78a54849ddf90

> Mesos-slave issue report 
> -
>
> Key: MESOS-10237
> URL: https://issues.apache.org/jira/browse/MESOS-10237
> Project: Mesos
>  Issue Type: Bug
>Reporter: feixiachao
>Priority: Major
>
> we encountered an issue about mesos-slave , the mesos.ERROR log shown as 
> below:
> E0323 22:56:03.278918  2848 memory.cpp:502] Listening on OOM events failed 
> for container ff408971-b610-4f84-bbc3-81b0c6be9499: Event listener is 
> terminating
> E0323 22:58:06.018554  2834 memory.cpp:502] Listening on OOM events failed 
> for container 3afa2056-1976-4857-9121-cfad0f0ba73e: Event listener is 
> terminating
> E0323 23:12:05.261996  2816 memory.cpp:502] Listening on OOM events failed 
> for container 56912877-5733-4050-bce8-0cc179cc0bc8: Event listener is 
> terminating
> Could any someone to help for this issue ?
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos

2022-02-15 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17492857#comment-17492857
 ] 

Charles Natali commented on MESOS-10234:


Hi,

I cannot see an explicit dependency on log4j v1.2.17 - are you sure the build 
is not picking up your system's version?

Then again I'm really not familiar with the java bindings.

Note that the only log4j which is shipped with Mesos is part of the zookeeper 
version packaged:


{noformat}
./build/3rdparty/zookeeper-3.4.8/lib/slf4j-log4j12-1.6.1.jar
./build/3rdparty/zookeeper-3.4.8/lib/log4j-1.2.16.LICENSE.txt
./build/3rdparty/zookeeper-3.4.8/lib/log4j-1.2.16.jar
./build/3rdparty/zookeeper-3.4.8/src/java/lib/log4j-1.2.16.LICENSE.txt
./build/3rdparty/zookeeper-3.4.8/src/contrib/loggraph/web/org/apache/zookeeper/graph/log4j.properties
./build/3rdparty/zookeeper-3.4.8/src/contrib/rest/conf/log4j.properties
./build/3rdparty/zookeeper-3.4.8/src/contrib/zooinspector/lib/log4j.properties
./build/3rdparty/zookeeper-3.4.8/conf/log4j.properties
./build/3rdparty/zookeeper-3.4.8/contrib/rest/lib/slf4j-log4j12-1.6.1.jar
./build/3rdparty/zookeeper-3.4.8/contrib/rest/lib/log4j-1.2.15.jar
./build/3rdparty/zookeeper-3.4.8/contrib/rest/conf/log4j.properties

{noformat}


I'm not sure if anyone uses the shipped version, but maybe we could update it, 
what do you think [~asekretenko]?

Note that at work we experienced a zookeeper bug following a failover which 
IIRC caused some ephemeral nodes to not be deleted on the promoted leader, 
leading to inconsistencies in the Mesos registry - so updating could also solve 
this issue for whoever happens to use it.

> CVE-2021-44228 Log4j vulnerability for apache mesos
> ---
>
> Key: MESOS-10234
> URL: https://issues.apache.org/jira/browse/MESOS-10234
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11.0
>Reporter: Sangita Nalkar
>Priority: Critical
>
> Hi,
> Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache 
> mesos.
> We see that log4j v1.2.17 is used while building apache mesos from source.
> Snippet from build logs:
> std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF 
> jvm/org/apache/.deps/libjava_la-log4j.Tpo -c 
> ../../src/jvm/org/apache/log4j.cpp  -fPIC -DPIC -o 
> jvm/org/apache/.libs/libjava_la-log4j.o
> Thanks,
> Sangita



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (MESOS-10234) CVE-2021-44228 Log4j vulnerability for apache mesos

2021-12-31 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17467229#comment-17467229
 ] 

Charles Natali commented on MESOS-10234:


Hi [~snalkar]

Sorry for the delay, but Mesos has very little resources, and holiday season 
doesn't help.

I've had a quick look, and log4j only seems to be used for tests - Mesos is 
mostly written in C++, so it's not surprising.
It's possible it's used in some third-party dependencies included, but I'd be 
surprised if it was exploitable.

I'll have a more thorough look after the holidays.

Cheers,

> CVE-2021-44228 Log4j vulnerability for apache mesos
> ---
>
> Key: MESOS-10234
> URL: https://issues.apache.org/jira/browse/MESOS-10234
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11.0
>Reporter: Sangita Nalkar
>Priority: Critical
>
> Hi,
> Wanted to know if CVE-2021-44228 Log4j vulnerability is affecting Apache 
> mesos.
> We see that log4j v1.2.17 is used while building apache mesos from source.
> Snippet from build logs:
> std=c++11 -MT jvm/org/apache/libjava_la-log4j.lo -MD -MP -MF 
> jvm/org/apache/.deps/libjava_la-log4j.Tpo -c 
> ../../src/jvm/org/apache/log4j.cpp  -fPIC -DPIC -o 
> jvm/org/apache/.libs/libjava_la-log4j.o
> Thanks,
> Sangita



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Assigned] (MESOS-9657) Launching a command task twice can crash the agent

2021-10-16 Thread Charles Natali (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-9657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Natali reassigned MESOS-9657:
-

Fix Version/s: 1.12.0
 Assignee: Charles Natali
   Resolution: Fixed

> Launching a command task twice can crash the agent
> --
>
> Key: MESOS-9657
> URL: https://issues.apache.org/jira/browse/MESOS-9657
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Assignee: Charles Natali
>Priority: Major
> Fix For: 1.12.0
>
>
> When launching a command task, we verify that the framework has no existing 
> executor for that task:
> {noformat}
>   // We are dealing with command task; a new command executor will be
>   // launched.
>   CHECK(executor == nullptr);
> {noformat}
> and afterwards an executor is created with the same executor id as the task 
> id:
> {noformat}
>   // (slave.cpp)
>   // Either the master explicitly requests launching a new executor
>   // or we are in the legacy case of launching one if there wasn't
>   // one already. Either way, let's launch executor now.
>   if (executor == nullptr) {
> Try added = framework->addExecutor(executorInfo);
>   [...]
> {noformat}
> This means that if we relaunch the task with the same task id before the 
> executor is removed, it will crash the agent:
> {noformat}
> F0315 16:39:32.822818 38112 slave.cpp:2865] Check failed: executor == nullptr 
> *** Check failure stack trace: ***
> @ 0x7feb29a407af  google::LogMessage::Flush()
> @ 0x7feb29a43c3f  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7feb28a5a886  mesos::internal::slave::Slave::__run()
> @ 0x7feb28af4f0e  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSK_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaISU_EERKSK_IbESG_SJ_SO_SS_SY_S11_EEvRKNS1_3PIDIT_EEMS13_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSE_OSH_OSM_OSQ_OSW_OSZ_S3_E_JSE_SH_SM_SQ_SW_SZ_St12_PlaceholderILi1EEclEOS3_
> @ 0x7feb2998a620  process::ProcessBase::consume()
> @ 0x7feb29987675  process::ProcessManager::resume()
> @ 0x7feb299a2d2b  
> _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7process14ProcessManager12init_threadsEvE3$_8E6_M_runEv
> @ 0x7feb2632f523  (unknown)
> @ 0x7feb25e40594  start_thread
> @ 0x7feb25b73e6f  __GI___clone
> Aborted (core dumped)
> {noformat}
> Instead of crashing, the agent should just drop the task with an appropriate 
> error in this case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10198) Mesos-master service is activating state

2021-09-26 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420368#comment-17420368
 ] 

Charles Natali commented on MESOS-10198:


[~kiranjshetty]

I assume you've since moved on, so unless there is an update to this ticket 
soon, I will close.

Cheers,


> Mesos-master service is activating state
> 
>
> Key: MESOS-10198
> URL: https://issues.apache.org/jira/browse/MESOS-10198
> Project: Mesos
>  Issue Type: Task
>Affects Versions: 1.9.0
>Reporter: Kiran J Shetty
>Priority: Major
>
> mesos-master service showing activating state on all 3 master node and which 
> intern making marathon to restart frequently . in logs I can see below entry.
>  Mesos-master logs:
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a864206a9 
> mesos::internal::log::ReplicaProcess::ReplicaProcess()
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a86420854 
> mesos::internal::log::Replica::Replica()
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6a65 
> mesos::internal::log::LogProcess::LogProcess()
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6e34 
> mesos::log::Log::Log()
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a3ec72 main
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a8207 
> __libc_start_main
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a40d0a (unknown)
>  Nov 12 08:36:29 servername systemd[1]: mesos-master.service: main process 
> exited, code=killed, status=6/ABRT
>  Nov 12 08:36:29 servername systemd[1]: Unit mesos-master.service entered 
> failed state.
>  Nov 12 08:36:29 servername systemd[1]: mesos-master.service failed.
>  Nov 12 08:36:49 servername systemd[1]: mesos-master.service holdoff time 
> over, scheduling restart.
>  Nov 12 08:36:49 servername systemd[1]: Stopped Mesos Master.
>  Nov 12 08:36:49 servername systemd[1]: Started Mesos Master.
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.633597 20024 
> logging.cpp:201] INFO level logging started!
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634446 20024 
> main.cpp:243] Build: 2019-10-21 12:10:14 by centos
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634460 20024 
> main.cpp:244] Version: 1.9.0
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634466 20024 
> main.cpp:247] Git tag: 1.9.0
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634470 20024 
> main.cpp:251] Git SHA: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.636653 20024 
> main.cpp:345] Using 'hierarchical' allocator
>  Nov 12 08:36:49 servername mesos-master[20037]: mesos-master: 
> ./db/skiplist.h:344: void leveldb::SkipList::Insert(const 
> Key&) [with Key = const char*; Comparator = 
> leveldb::MemTable::KeyComparator]: Assertion `x == __null || !Equal(key, 
> x->key)' failed.
>  Nov 12 08:36:49 servername mesos-master[20037]: *** Aborted at 1605150409 
> (unix time) try "date -d @1605150409" if you are using GNU date ***
>  Nov 12 08:36:49 servername mesos-master[20037]: PC: @ 0x7fdee16ed387 
> __GI_raise
>  Nov 12 08:36:49 servername mesos-master[20037]: *** SIGABRT (@0x4e38) 
> received by PID 20024 (TID 0x7fdee720ea00) from PID 20024; stack trace: ***
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee1fb2630 (unknown)
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16ed387 __GI_raise
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16eea78 __GI_abort
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e61a6 
> __assert_fail_base
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e6252 
> __GI___assert_fail
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cf3dc2 
> leveldb::SkipList<>::Insert()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cf3735 
> leveldb::MemTable::Add()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5d00168 
> leveldb::WriteBatch::Iterate()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5d00424 
> leveldb::WriteBatchInternal::InsertInto()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5ce8575 
> leveldb::DBImpl::RecoverLogFile()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cec0fc 
> leveldb::DBImpl::Recover()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cec3fa 
> leveldb::DB::Open()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a0f877 
> mesos::internal::log::LevelDBStorage::restore()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a817a2 
> mesos::internal::log::ReplicaProcess::restore()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5a846a9 
> mesos::internal::log::ReplicaProcess::ReplicaProcess()

[jira] [Commented] (MESOS-10230) Please update JQuery from 3.2.1 to 3.5.0+

2021-09-26 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420364#comment-17420364
 ] 

Charles Natali commented on MESOS-10230:


[~apeters]
Would you be able to look at this?

I think [~pengels] might be referring to 
https://github.com/apache/mesos/blob/master/src/webui/assets/libs/jquery-3.2.1.min.js

Note however that we are also using jquery1.10.1 which is also affected:
https://github.com/apache/mesos/blob/master/site/source/assets/js/jquery-1.10.1.min.js

and in mesos-site: 
https://github.com/apache/mesos-site/blob/asf-site/content/assets/js/jquery-1.10.1.min.js

I am absolutely not familiar with web development so even though I could 
probably update it I wouldn't know how to check if it broke anything.

> Please update JQuery from 3.2.1 to 3.5.0+
> -
>
> Key: MESOS-10230
> URL: https://issues.apache.org/jira/browse/MESOS-10230
> Project: Mesos
>  Issue Type: Improvement
>  Components: security
>Affects Versions: 1.11.0
>Reporter: p engels
>Priority: Minor
>
> JQuery versions between 1.2 and 3.5.0 are vulnerable to multiple 
> cross-site-scripting vulnerabilities. More info can be found on JQuery's 
> website:
> blog.jquery.com: [https://blog.jquery.com/2020/04/10/jquery-3-5-0-released/]
> My organization's vulnerability scanner locates the out-of-date jquery at 
> this url (sanitized for security reasons):
> [http://example.com:5050/assets/libs/jquery-3.2.1.min.js]
>  
> Please remove the old version of JQuery and replace it with version 3.5.0 or 
> greater. If this is already planned for a future release, please comment on 
> this request with the version this will be fixed in.
>  
> Keep up the good work, Apache community <3



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10228) My current problem is that after mesos-Agent added the option to support GPU, starting Docker through Marathon cannot succeed

2021-09-26 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420361#comment-17420361
 ] 

Charles Natali commented on MESOS-10228:


Hi [~barrylee],

It's not clear to me if this is linked to the other issue you opened: 
https://issues.apache.org/jira/browse/MESOS-10227

Note that Marathon is a project distinct from Mesos, so you might want to 
report it with them (although I am not sure the project is still active).

> My current problem is that after mesos-Agent added the option to support GPU, 
> starting Docker through Marathon cannot succeed
> -
>
> Key: MESOS-10228
> URL: https://issues.apache.org/jira/browse/MESOS-10228
> Project: Mesos
>  Issue Type: Task
>  Components: agent, framework
>Affects Versions: 1.11.0
>Reporter: barry lee
>Priority: Major
> Fix For: 1.11.0
>
> Attachments: image-2021-08-19-19-22-51-456.png
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> My current problem is that after mesos-Agent added the option to support GPU, 
> starting Docker through Marathon cannot succeed.
> mesos-agent \
> --master=zk://192.168.10.191:2181,192.168.10.192:2181,192.168.10.193:2181/mesos
>  \
> --log_dir=/var/log/mesos \
> --containerizers=docker,mesos \
> --executor_registration_timeout=5mins \
> --hostname=192.168.10.19 \
> --ip=192.168.10.19 \
> --port=5051 \
> --work_dir=/var/lib/mesos \
> --image_providers=docker \
> —executor_environment_variables="{}" \
> --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia"
>  
> In the MESos-Agent GPU option, this is useful when there is no GPU node.
>  
> !image-2021-08-19-19-22-51-456.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10227) After mesos-agent starts, mesos-exeute fails to be executed using the GPU

2021-09-26 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10227?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17420360#comment-17420360
 ] 

Charles Natali commented on MESOS-10227:


Hi [~barrylee],

Sorry for the delay.
Is this still a problem?
The log you're providing is truncated, it would be useful to get:
- the agent logs, when the task is started
- the executor log



> After mesos-agent starts, mesos-exeute fails to be executed using the GPU
> -
>
> Key: MESOS-10227
> URL: https://issues.apache.org/jira/browse/MESOS-10227
> Project: Mesos
>  Issue Type: Task
>  Components: agent
>Affects Versions: 1.11.0
> Environment: mesos-agent \
> --master=zk://192.168.10.191:2181,192.168.10.192:2181,192.168.10.193:2181/mesos
>  \
> --log_dir=/var/log/mesos --containerizers=docker,mesos \
> --executor_registration_timeout=5mins \
> --hostname=192.168.10.19 \
> --ip=192.168.10.19 \
> --port=5051 \
> --work_dir=/var/lib/mesos \
> --image_providers=docker \
> —executor_environment_variables="{}" \
> --isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia"
>  
>  
> mesos-execute \
> --master=zk://192.168.10.191:2181,192.168.10.192:2181,192.168.10.193:2181/mesos
>  \
> --name=gpu-test \
> --docker_image=nvidia/cuda \
> --command="nvidia-smi" \
> --framework_capabilities="GPU_RESOURCES" \
> --resources="gpus:1"
>  
>Reporter: barry lee
>Priority: Major
> Fix For: 1.11.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> I0819 18:14:26.088129 9337 containerizer.cpp:3414] Transitioning the state of 
> container fab468e6-bcbd-499c-9c24-ccd572c8317b from PROVISIONING to 
> DESTROYING after 2.207289088secs
> I0819 18:14:26.089609 9339 slave.cpp:7100] Executor 'gpu-test' of framework 
> d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 has terminated with unknown status
> I0819 18:14:26.091435 9339 slave.cpp:5981] Handling status update TASK_FAILED 
> (Status UUID: 0abd4e4b-59a6-4610-b624-05762ab9fc17) for task gpu-test of 
> framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 from @0.0.0.0:0
> E0819 18:14:26.092530 9346 slave.cpp:6357] Failed to update resources for 
> container fab468e6-bcbd-499c-9c24-ccd572c8317b of executor 'gpu-test' running 
> task gpu-test on status update for terminal task, destroying container: 
> Container not found
> W0819 18:14:26.092737 9341 composing.cpp:614] Attempted to destroy unknown 
> container fab468e6-bcbd-499c-9c24-ccd572c8317b
> I0819 18:14:26.092895 9331 task_status_update_manager.cpp:328] Received task 
> status update TASK_FAILED (Status UUID: 0abd4e4b-59a6-4610-b624-05762ab9fc17) 
> for task gpu-test of framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027
> I0819 18:14:26.093626 9333 slave.cpp:6527] Forwarding the update TASK_FAILED 
> (Status UUID: 0abd4e4b-59a6-4610-b624-05762ab9fc17) for task gpu-test of 
> framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027 to 
> master@192.168.10.192:5050
> I0819 18:14:26.102195 9342 slave.cpp:4310] Shutting down framework 
> d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027
> I0819 18:14:26.102257 9342 slave.cpp:7218] Cleaning up executor 'gpu-test' of 
> framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027
> I0819 18:14:26.102448 9332 gc.cpp:95] Scheduling 
> '/var/lib/mesos/slaves/d5cb56f3-1f2f-49e6-b63b-a401e445104d-S125/frameworks/d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027/executors/gpu-test/runs/fab468e6-bcbd-499c-9c24-ccd572c8317b'
>  for gc 6.988156days in the future
> I0819 18:14:26.102600 9332 gc.cpp:95] Scheduling 
> '/var/lib/mesos/slaves/d5cb56f3-1f2f-49e6-b63b-a401e445104d-S125/frameworks/d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027/executors/gpu-test'
>  for gc 6.9881303111days in the future
> I0819 18:14:26.102725 9342 slave.cpp:7347] Cleaning up framework 
> d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027
> I0819 18:14:26.102805 9335 task_status_update_manager.cpp:289] Closing task 
> status update streams for framework d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027
> I0819 18:14:26.102901 9342 gc.cpp:95] Scheduling 
> '/var/lib/mesos/slaves/d5cb56f3-1f2f-49e6-b63b-a401e445104d-S125/frameworks/d5cb56f3-1f2f-49e6-b63b-a401e445104d-0027'
>  for gc 6.9881020741days in the future
> I0819 18:14:34.385221 9334 http.cpp:1436] HTTP GET for 
> /files/browse?path=%2Fvar%2Flib%2Fmesos%2Fslaves%2Fd5cb56f3-1f2f-49e6-b63b-a401e445104d-S125=angular.callbacks._67
>  from 192.168.110.142:11640 with User-Agent='Mozilla/5.0 (Windows NT 10.0; 
> Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 
> Safari/537.36'
> I0819 18:14:45.385519 9344 http.cpp:1436] HTTP GET for 
> /files/browse?path=%2Fvar%2Flib%2Fmesos%2Fslaves%2Fd5cb56f3-1f2f-49e6-b63b-a401e445104d-S125=angular.callbacks._6a
>  from 192.168.110.142:11690 with User-Agent='Mozilla/5.0 (Windows NT 10.0; 
> Win64; x64) 

[jira] [Commented] (MESOS-10198) Mesos-master service is activating state

2021-08-07 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17395290#comment-17395290
 ] 

Charles Natali commented on MESOS-10198:


Hi [~kiranjshetty], sorry for the delay, I know it's been a while.


{noformat}
Nov 12 08:36:49 servername mesos-master[20037]: mesos-master: 
./db/skiplist.h:344: void leveldb::SkipList::Insert(const 
Key&) [with Key = const char*; Comparator = leveldb::MemTable::KeyComparator]: 
Assertion `x == __null || !Equal(key, x->key)' failed.
{noformat}


This points to a corruption of the on-disk leveldb database - it's been a long 
time, but do you remember if:
- this specific error was present in all the masters logs?
- did the hosts maybe crash prior to that?
- I guess it's too late, but it would have been interesting to see the log of 
the first time the masters crashed

Looking at our code, it's not clear to me what we could do to introduce a 
leveldb corruption - the only possibilities I can think of are a leveldb bug, 
or maybe in specific conditions some unrelated code ends up writing to the 
leveldb file descriptors, which could cause such a corruption.
But having it occur across all masters seems very unlikely.

> Mesos-master service is activating state
> 
>
> Key: MESOS-10198
> URL: https://issues.apache.org/jira/browse/MESOS-10198
> Project: Mesos
>  Issue Type: Task
>Affects Versions: 1.9.0
>Reporter: Kiran J Shetty
>Priority: Major
>
> mesos-master service showing activating state on all 3 master node and which 
> intern making marathon to restart frequently . in logs I can see below entry.
>  Mesos-master logs:
> Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a864206a9 
> mesos::internal::log::ReplicaProcess::ReplicaProcess()
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a86420854 
> mesos::internal::log::Replica::Replica()
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6a65 
> mesos::internal::log::LogProcess::LogProcess()
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a863b6e34 
> mesos::log::Log::Log()
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a3ec72 main
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x7f1a8207 
> __libc_start_main
>  Nov 12 08:36:29 servername mesos-master[19867]: @ 0x561155a40d0a (unknown)
>  Nov 12 08:36:29 servername systemd[1]: mesos-master.service: main process 
> exited, code=killed, status=6/ABRT
>  Nov 12 08:36:29 servername systemd[1]: Unit mesos-master.service entered 
> failed state.
>  Nov 12 08:36:29 servername systemd[1]: mesos-master.service failed.
>  Nov 12 08:36:49 servername systemd[1]: mesos-master.service holdoff time 
> over, scheduling restart.
>  Nov 12 08:36:49 servername systemd[1]: Stopped Mesos Master.
>  Nov 12 08:36:49 servername systemd[1]: Started Mesos Master.
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.633597 20024 
> logging.cpp:201] INFO level logging started!
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634446 20024 
> main.cpp:243] Build: 2019-10-21 12:10:14 by centos
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634460 20024 
> main.cpp:244] Version: 1.9.0
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634466 20024 
> main.cpp:247] Git tag: 1.9.0
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.634470 20024 
> main.cpp:251] Git SHA: 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e
>  Nov 12 08:36:49 servername mesos-master[20037]: I1112 08:36:49.636653 20024 
> main.cpp:345] Using 'hierarchical' allocator
>  Nov 12 08:36:49 servername mesos-master[20037]: mesos-master: 
> ./db/skiplist.h:344: void leveldb::SkipList::Insert(const 
> Key&) [with Key = const char*; Comparator = 
> leveldb::MemTable::KeyComparator]: Assertion `x == __null || !Equal(key, 
> x->key)' failed.
>  Nov 12 08:36:49 servername mesos-master[20037]: *** Aborted at 1605150409 
> (unix time) try "date -d @1605150409" if you are using GNU date ***
>  Nov 12 08:36:49 servername mesos-master[20037]: PC: @ 0x7fdee16ed387 
> __GI_raise
>  Nov 12 08:36:49 servername mesos-master[20037]: *** SIGABRT (@0x4e38) 
> received by PID 20024 (TID 0x7fdee720ea00) from PID 20024; stack trace: ***
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee1fb2630 (unknown)
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16ed387 __GI_raise
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16eea78 __GI_abort
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e61a6 
> __assert_fail_base
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee16e6252 
> __GI___assert_fail
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 0x7fdee5cf3dc2 
> leveldb::SkipList<>::Insert()
>  Nov 12 08:36:49 servername mesos-master[20037]: @ 

[jira] [Commented] (MESOS-10200) cmake target "install" not available in 1.10.x branch

2021-08-02 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391782#comment-17391782
 ] 

Charles Natali commented on MESOS-10200:


[~apeters]
It's not quite clear to me, is it still a problem in master?

> cmake target "install" not available in 1.10.x branch
> -
>
> Key: MESOS-10200
> URL: https://issues.apache.org/jira/browse/MESOS-10200
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.10.0
> Environment: OS: Mac OS X Catalina (10.15.7).
>Reporter: PRUDHVI RAJ MULAGAPATI
>Priority: Major
> Attachments: 10198.html
>
>
> I am trying to build mesos on Mac OS X 10.15.7 (Catalina) following the 
> official documentation. While in 1.10.x branch cmake target "install" is not 
> found. However I was able to build and install with 3.11.x and master 
> branches. Listed below are the available targets as shown by cmake --target 
> help.
>  
> cmake --build . --target install
> make: *** No rule to make target `install'. Stop.
>  
> cmake --build . --target help
> The following are some of the valid targets for this Makefile:
> ... all (the default if no target is provided)
> ... clean
> ... depend
> ... edit_cache
> ... package
> ... package_source
> ... rebuild_cache
> ... test
> ... boost-1.65.0
> ... check
> ... concurrentqueue-7b69a8f
> ... csi_v0-0.2.0
> ... csi_v1-1.1.0
> ... dist
> ... distcheck
> ... elfio-3.2
> ... glog-0.4.0
> ... googletest-1.8.0
> ... grpc-1.10.0
> ... http_parser-2.6.2
> ... leveldb-1.19
> ... libarchive-3.3.2
> ... libev-4.22
> ... make_bin_include_dir
> ... make_bin_java_dir
> ... make_bin_jni_dir
> ... make_bin_src_dir
> ... nvml-352.79
> ... picojson-1.3.0
> ... protobuf-3.5.0
> ... rapidjson-1.1.0
> ... tests
> ... zookeeper-3.4.8
> ... balloon-executor
> ... balloon-framework
> ... benchmarks
> ... disk-full-framework
> ... docker-no-executor-framework
> ... dynamic-reservation-framework
> ... example
> ... examplemodule
> ... fixed_resource_estimator
> ... inverse-offer-framework
> ... libprocess-tests
> ... load-generator-framework
> ... load_qos_controller
> ... logrotate_container_logger
> ... long-lived-executor
> ... long-lived-framework
> ... mesos
> ... mesos-agent
> ... mesos-cli
> ... mesos-cni-port-mapper
> ... mesos-containerizer
> ... mesos-default-executor
> ... mesos-docker-executor
> ... mesos-execute
> ... mesos-executor
> ... mesos-fetcher
> ... mesos-io-switchboard
> ... mesos-local
> ... mesos-log
> ... mesos-logrotate-logger
> ... mesos-master
> ... mesos-protobufs
> ... mesos-resolve
> ... mesos-tcp-connect
> ... mesos-tests
> ... mesos-usage
> ... no-executor-framework
> ... operation-feedback-framework
> ... persistent-volume-framework
> ... process
> ... stout-tests
> ... test-csi-user-framework
> ... test-executor
> ... test-framework
> ... test-helper
> ... test-http-executor
> ... test-http-framework
> ... test-linkee
> ... testallocator
> ... testanonymous
> ... testauthentication
> ... testauthorizer
> ... testcontainer_logger
> ... testhook
> ... testhttpauthenticator
> ... testisolator
> ... testmastercontender
> ... testmasterdetector
> ... testqos_controller
> ... testresource_estimator
> ... uri_disk_profile_adaptor



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-08-02 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391736#comment-17391736
 ] 

Charles Natali commented on MESOS-10226:


Hm, it's annoying - the gdb backtrace you posted shows that the regtest gets 
stuck in this test, but for some reason  running this test on its own isn't 
enough to reproduce it.
It's going to be very difficult to debug without being able to run them myself.

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, 
> gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 host - the test will just hang reading 
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if 
> successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390152#comment-17390152
 ] 

Charles Natali edited comment on MESOS-10226 at 7/29/21, 8:44 PM:
--

Hm, I can't reproduce it.

I updated the test to run the arm64 alpine image to cause it to fail in a 
similar way that it should be failing for you, and it's not hanging, but 
failing:



{noformat}
# ./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand*

[ RUN ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0
sh: 1: hadoop: not found
Marked '/' as rslave
I0729 21:40:16.121507 434157 exec.cpp:164] Version: 1.12.0
I0729 21:40:16.136072 434156 exec.cpp:237] Executor registered on agent 
48863f87-f283-42ab-bd93-f301fdfbd73b-S0
I0729 21:40:16.139089 434154 executor.cpp:190] Received SUBSCRIBED event
I0729 21:40:16.139974 434154 executor.cpp:194] Subscribed executor on thinkpad
I0729 21:40:16.140264 434154 executor.cpp:190] Received LAUNCH event
I0729 21:40:16.141703 434154 executor.cpp:722] Starting task 
1461a266-1ead-4bdf-9165-9c0f6c5938b8
I0729 21:40:16.147071 434154 executor.cpp:740] Forked command at 434163
Preparing rootfs at 
'/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634'
Changing root to 
/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634
Failed to execute '/bin/ls': Exec format error
I0729 21:40:16.321754 434155 executor.cpp:1041] Command exited with status 1 
(pid: 434163)
../../src/tests/containerizer/provisioner_docker_tests.cpp:785: Failure
 Expected: TASK_FINISHED
To be equal to: statusFinished->state()
 Which is: TASK_FAILED
I0729 21:40:16.333557 434157 exec.cpp:478] Executor asked to shutdown
I0729 21:40:16.334996 434158 executor.cpp:190] Received SHUTDOWN event
I0729 21:40:16.335037 434158 executor.cpp:843] Shutting down
[ FAILED ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where 
GetParam() = "arm64v8/alpine" (5851 ms)

{noformat}


 

Could you try running


{noformat}
./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* 
--verbose

{noformat}
 

And see if it hangs, and post the result?

 

Worst case we could just ignore the hang and update the test to use the arn64 
image so it passes, but I'd like to understand why it hangs.


was (Author: cf.natali):
Hm, I can't reproduce it.

I updated the test to run the arm64 alpine image to cause it to fail in a 
similar way that it should be failing for you, and it's not hanging, but 
failing:

```

# ./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand*

[ RUN ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0
sh: 1: hadoop: not found
Marked '/' as rslave
I0729 21:40:16.121507 434157 exec.cpp:164] Version: 1.12.0
I0729 21:40:16.136072 434156 exec.cpp:237] Executor registered on agent 
48863f87-f283-42ab-bd93-f301fdfbd73b-S0
I0729 21:40:16.139089 434154 executor.cpp:190] Received SUBSCRIBED event
I0729 21:40:16.139974 434154 executor.cpp:194] Subscribed executor on thinkpad
I0729 21:40:16.140264 434154 executor.cpp:190] Received LAUNCH event
I0729 21:40:16.141703 434154 executor.cpp:722] Starting task 
1461a266-1ead-4bdf-9165-9c0f6c5938b8
I0729 21:40:16.147071 434154 executor.cpp:740] Forked command at 434163
Preparing rootfs at 
'/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634'
Changing root to 
/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634
Failed to execute '/bin/ls': Exec format error
I0729 21:40:16.321754 434155 executor.cpp:1041] Command exited with status 1 
(pid: 434163)
../../src/tests/containerizer/provisioner_docker_tests.cpp:785: Failure
 Expected: TASK_FINISHED
To be equal to: statusFinished->state()
 Which is: TASK_FAILED
I0729 21:40:16.333557 434157 exec.cpp:478] Executor asked to shutdown
I0729 21:40:16.334996 434158 executor.cpp:190] Received SHUTDOWN event
I0729 21:40:16.335037 434158 executor.cpp:843] Shutting down
[ FAILED ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where 
GetParam() = "arm64v8/alpine" (5851 ms)

```

 

Could you try running

```

./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* 
--verbose

```

 

And see if it hangs, and 

[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390152#comment-17390152
 ] 

Charles Natali commented on MESOS-10226:


Hm, I can't reproduce it.

I updated the test to run the arm64 alpine image to cause it to fail in a 
similar way that it should be failing for you, and it's not hanging, but 
failing:

```

# ./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand*

[ RUN ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0
sh: 1: hadoop: not found
Marked '/' as rslave
I0729 21:40:16.121507 434157 exec.cpp:164] Version: 1.12.0
I0729 21:40:16.136072 434156 exec.cpp:237] Executor registered on agent 
48863f87-f283-42ab-bd93-f301fdfbd73b-S0
I0729 21:40:16.139089 434154 executor.cpp:190] Received SUBSCRIBED event
I0729 21:40:16.139974 434154 executor.cpp:194] Subscribed executor on thinkpad
I0729 21:40:16.140264 434154 executor.cpp:190] Received LAUNCH event
I0729 21:40:16.141703 434154 executor.cpp:722] Starting task 
1461a266-1ead-4bdf-9165-9c0f6c5938b8
I0729 21:40:16.147071 434154 executor.cpp:740] Forked command at 434163
Preparing rootfs at 
'/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634'
Changing root to 
/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634
Failed to execute '/bin/ls': Exec format error
I0729 21:40:16.321754 434155 executor.cpp:1041] Command exited with status 1 
(pid: 434163)
../../src/tests/containerizer/provisioner_docker_tests.cpp:785: Failure
 Expected: TASK_FINISHED
To be equal to: statusFinished->state()
 Which is: TASK_FAILED
I0729 21:40:16.333557 434157 exec.cpp:478] Executor asked to shutdown
I0729 21:40:16.334996 434158 executor.cpp:190] Received SHUTDOWN event
I0729 21:40:16.335037 434158 executor.cpp:843] Shutting down
[ FAILED ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where 
GetParam() = "arm64v8/alpine" (5851 ms)

```

 

Could you try running

```

./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* 
--verbose

```

 

And see if it hangs, and post the result?

 

Worst case we could just ignore the hang and update the test to use the arn64 
image so it passes, but I'd like to understand why it hangs.

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, 
> gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 

[jira] [Comment Edited] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390058#comment-17390058
 ] 

Charles Natali edited comment on MESOS-10226 at 7/29/21, 6:09 PM:
--

[~mgrigorov] Looking at the code corresponding to the backtrace, I don't think 
it should hang foreverm but only up to 10 minutes:

 
{noformat}
#13 0xb7ca1418 in AwaitAssertReady 
(expr=0xba1c1d58 "statusStarting", actual=..., duration=...) at 
../../3rdparty/libprocess/include/process/gtest.hpp:126
#14 0xb97c588c in 
mesos::internal::tests::ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_Test::TestBody
 (this=0xcd4207a0) at 
../../src/tests/containerizer/provisioner_docker_tests.cpp:782
{noformat}
 

 
{noformat}
 
AWAIT_READY_FOR(statusStarting, Minutes(10));{noformat}
 

Are you sure it was stuck indefinitely and not just taking a long time?

 

Also, it would help to have the output of running the tests with {{--verbose}}.


was (Author: cf.natali):
[~mgrigorov] Looking at the code corresponding to the backtrace, I don't think 
it should hang foreverm but only up to 10 minutes:

 
{noformat}
#13 0xb7ca1418 in AwaitAssertReady 
(expr=0xba1c1d58 "statusStarting", actual=..., duration=...) at 
../../3rdparty/libprocess/include/process/gtest.hpp:126
#14 0xb97c588c in 
mesos::internal::tests::ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_Test::TestBody
 (this=0xcd4207a0) at 
../../src/tests/containerizer/provisioner_docker_tests.cpp:782
{noformat}
 

 
{noformat}
 
AWAIT_READY_FOR(statusStarting, Minutes(10));{noformat}
 

Are you sure it was stuck indefinitely and not just taking a long time?

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 

[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390058#comment-17390058
 ] 

Charles Natali commented on MESOS-10226:


[~mgrigorov] Looking at the code corresponding to the backtrace, I don't think 
it should hang foreverm but only up to 10 minutes:

 
{noformat}
#13 0xb7ca1418 in AwaitAssertReady 
(expr=0xba1c1d58 "statusStarting", actual=..., duration=...) at 
../../3rdparty/libprocess/include/process/gtest.hpp:126
#14 0xb97c588c in 
mesos::internal::tests::ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_Test::TestBody
 (this=0xcd4207a0) at 
../../src/tests/containerizer/provisioner_docker_tests.cpp:782
{noformat}
 

 
{noformat}
 
AWAIT_READY_FOR(statusStarting, Minutes(10));{noformat}
 

Are you sure it was stuck indefinitely and not just taking a long time?

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 

[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390055#comment-17390055
 ] 

Charles Natali commented on MESOS-10226:


Thanks, I'll have a look - I hope there won't be too many hanging tests...

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 host - the test will just hang reading 
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if 
> successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10226) test suite hangs on ARM64

2021-07-28 Thread Charles Natali (Jira)
Charles Natali created MESOS-10226:
--

 Summary: test suite hangs on ARM64
 Key: MESOS-10226
 URL: https://issues.apache.org/jira/browse/MESOS-10226
 Project: Mesos
  Issue Type: Bug
Reporter: Charles Natali
Assignee: Charles Natali


Reported by [~mgrigorov].

 
{noformat}
[ RUN      ] 
NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
sh: 1: hadoop: not found
Marked '/' as rslave
I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
9076f44b-846d-4f00-a2dc-11f694cc1900-S0
I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
martin-arm64
I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
Preparing rootfs at 
'/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
Changing root to 
/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
Failed to execute 'sh': Exec format error
I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
(pid: 38)
../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: Failure
Mock function called more times than expected - returning directly.
    Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte object 
<08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 00-00 A8-F6 
C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 A0-F1 05-94 
FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 20-BD 01-78 
FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 03-00 00-00>)
         Expected: to be called twice
           Actual: called 3 times - over-saturated and active
I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
loop{noformat}
 

I asked him to provide a gdb traceback and we can see the following:

 
{noformat}


Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
#0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 "/tmp/7VXP3w/pipe", 
oflag=) at ../sysdeps/unix/sysv/linux/open64.c:48
#1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
filename=, posix_mode=, prot=prot@entry=438, 
read_write=8, is32not64=) at fileops.c:189
#2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode=, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
ntry=1) at fileops.c:281 
#3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
"/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
#4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
../../3rdparty/stout/include/stout/os/read.hpp:136
#5 0xd74f1c1c in 
mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
 (this=0xaaab00f88f50) at ../../src/tests/containeri
zer/nested_mesos_containerizer_tests.cpp:1126
{noformat}
 

 

Basically the test uses a named pipe to synchronize with the task being 
started, and if the task fails to start - in this case because we're trying to 
launch an x86 container on an arm64 host - the test will just hang reading from 
the pipe.

I send Martin a tentative fix for him to test, and I'll open an MR if 
successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9352) Data in persistent volume deleted accidentally when using Docker container and Persistent volume

2021-07-20 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17384479#comment-17384479
 ] 

Charles Natali commented on MESOS-9352:
---

If it's fixed feel free to close!

> Data in persistent volume deleted accidentally when using Docker container 
> and Persistent volume
> 
>
> Key: MESOS-9352
> URL: https://issues.apache.org/jira/browse/MESOS-9352
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, containerization, docker
>Affects Versions: 1.5.1, 1.5.2
> Environment: DCOS 1.11.6
> Mesos 1.5.2
>Reporter: David Ko
>Assignee: Joseph Wu
>Priority: Critical
>  Labels: dcos, dcos-1.11.6, mesosphere, persistent-volumes
> Attachments: image-2018-10-24-22-20-51-059.png, 
> image-2018-10-24-22-21-13-399.png
>
>
> Using docker image w/ persistent volume to start a service, it will cause 
> data in persistent volume deleted accidentally when task killed and 
> restarted, also old mount points not unmounted, even the service already 
> deleted. 
> *The expected result should be data in persistent volume kept until task 
> deleted completely, also dangling mount points should be unmounted correctly.*
>  
> *Step 1:* Use below JSON config to create a Mysql server using Docker image 
> and Persistent Volume
> {code:javascript}
> {
>   "env": {
> "MYSQL_USER": "wordpress",
> "MYSQL_PASSWORD": "secret",
> "MYSQL_ROOT_PASSWORD": "supersecret",
> "MYSQL_DATABASE": "wordpress"
>   },
>   "id": "/mysqlgc",
>   "backoffFactor": 1.15,
>   "backoffSeconds": 1,
>   "constraints": [
> [
>   "hostname",
>   "IS",
>   "172.27.12.216"
> ]
>   ],
>   "container": {
> "portMappings": [
>   {
> "containerPort": 3306,
> "hostPort": 0,
> "protocol": "tcp",
> "servicePort": 1
>   }
> ],
> "type": "DOCKER",
> "volumes": [
>   {
> "persistent": {
>   "type": "root",
>   "size": 1000,
>   "constraints": []
> },
> "mode": "RW",
> "containerPath": "mysqldata"
>   },
>   {
> "containerPath": "/var/lib/mysql",
> "hostPath": "mysqldata",
> "mode": "RW"
>   }
> ],
> "docker": {
>   "image": "mysql",
>   "forcePullImage": false,
>   "privileged": false,
>   "parameters": []
> }
>   },
>   "cpus": 1,
>   "disk": 0,
>   "instances": 1,
>   "maxLaunchDelaySeconds": 3600,
>   "mem": 512,
>   "gpus": 0,
>   "networks": [
> {
>   "mode": "container/bridge"
> }
>   ],
>   "residency": {
> "relaunchEscalationTimeoutSeconds": 3600,
> "taskLostBehavior": "WAIT_FOREVER"
>   },
>   "requirePorts": false,
>   "upgradeStrategy": {
> "maximumOverCapacity": 0,
> "minimumHealthCapacity": 0
>   },
>   "killSelection": "YOUNGEST_FIRST",
>   "unreachableStrategy": "disabled",
>   "healthChecks": [],
>   "fetch": []
> }
> {code}
> *Step 2:* Kill mysqld process to force rescheduling new Mysql task, but found 
> 2 mount points to the same persistent volume, it means old mount point did 
> not be unmounted immediately.
> !image-2018-10-24-22-20-51-059.png!
> *Step 3:* After GC, data in persistent volume was deleted accidentally, but 
> mysqld (Mesos task) still running
> !image-2018-10-24-22-21-13-399.png!
> *Step 4:* Delete Mysql service from Marathon, all mount points unable to 
> unmount, even the service already deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10223) Crashes on ARM64 due to bad interaction of libunwind with libgcc.

2021-07-01 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17372475#comment-17372475
 ] 

Charles Natali commented on MESOS-10223:


It must be a different issue then.

 

Could you run

 
{noformat}
# ./bin/mesos-tests.sh --verbose > mesos-tests.log 2>&1{noformat}
And post the result?

> Crashes on ARM64 due to bad interaction of libunwind with libgcc. 
> --
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Charles Natali
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz, sudo_make_check_output.txt
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 0x91353dac abort
> @ 0x91569848 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 0x915675b0 __cxa_rethrow
> @ 

[jira] [Commented] (MESOS-10223) Crashes on ARM64 due to bad interaction of libunwind with libgcc.

2021-06-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17371617#comment-17371617
 ] 

Charles Natali commented on MESOS-10223:


[~mgrigorov]

The hang should be fixed in master - it'd be great if you could give it a try.


> Crashes on ARM64 due to bad interaction of libunwind with libgcc. 
> --
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Charles Natali
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz, sudo_make_check_output.txt
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 0x91353dac abort
> @ 0x91569848 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 0x915675b0 __cxa_rethrow
> @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler()
> @ 

[jira] [Commented] (MESOS-10225) mention that systemd agent unit should have Delegate=yes

2021-06-28 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370808#comment-17370808
 ] 

Charles Natali commented on MESOS-10225:


Good question - I think having a dedicated section might be better, maybe 
"Interaction with systemd" or something like that?

> mention that systemd agent unit should have Delegate=yes
> 
>
> Key: MESOS-10225
> URL: https://issues.apache.org/jira/browse/MESOS-10225
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Charles Natali
>Assignee: Andreas Peters
>Priority: Major
>
> If managed by systemd, the agent unit should have 
> [Delegate=yes|https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=]
>  to prevent systemd from manipulating cgroups created by the agent, which can 
> break things quite badly.
> See for example https://issues.apache.org/jira/browse/MESOS-3488 and 
> https://issues.apache.org/jira/browse/MESOS-3009 for the kind of problems it 
> causes.
> I think it's quite important and should figure in good place in the 
> documentation, maybe in the agent configuration page 
> [http://mesos.apache.org/documentation/latest/configuration/agent/] ?
>  
> [~surahman] or [~apeters] if either one of you wants to have a look at it, I 
> think it's important that at least someone is familiar with the documentation 
> part.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10225) mention that systemd agent unit should have Delegate=yes

2021-06-27 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10225?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17370208#comment-17370208
 ] 

Charles Natali commented on MESOS-10225:


Thanks Andreas, that'd be great - hopefully will avoid some surprises to users.

> mention that systemd agent unit should have Delegate=yes
> 
>
> Key: MESOS-10225
> URL: https://issues.apache.org/jira/browse/MESOS-10225
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Charles Natali
>Assignee: Andreas Peters
>Priority: Major
>
> If managed by systemd, the agent unit should have 
> [Delegate=yes|https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=]
>  to prevent systemd from manipulating cgroups created by the agent, which can 
> break things quite badly.
> See for example https://issues.apache.org/jira/browse/MESOS-3488 and 
> https://issues.apache.org/jira/browse/MESOS-3009 for the kind of problems it 
> causes.
> I think it's quite important and should figure in good place in the 
> documentation, maybe in the agent configuration page 
> [http://mesos.apache.org/documentation/latest/configuration/agent/] ?
>  
> [~surahman] or [~apeters] if either one of you wants to have a look at it, I 
> think it's important that at least someone is familiar with the documentation 
> part.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (MESOS-10225) mention that systemd agent unit should have Delegate=yes

2021-06-25 Thread Charles Natali (Jira)
Charles Natali created MESOS-10225:
--

 Summary: mention that systemd agent unit should have Delegate=yes
 Key: MESOS-10225
 URL: https://issues.apache.org/jira/browse/MESOS-10225
 Project: Mesos
  Issue Type: Documentation
Reporter: Charles Natali


If managed by systemd, the agent unit should have 
[Delegate=yes|https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#Delegate=]
 to prevent systemd from manipulating cgroups created by the agent, which can 
break things quite badly.

See for example https://issues.apache.org/jira/browse/MESOS-3488 and 
https://issues.apache.org/jira/browse/MESOS-3009 for the kind of problems it 
causes.


I think it's quite important and should figure in good place in the 
documentation, maybe in the agent configuration page 
[http://mesos.apache.org/documentation/latest/configuration/agent/] ?

 

[~surahman] or [~apeters] if either one of you wants to have a look at it, I 
think it's important that at least someone is familiar with the documentation 
part.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10129) Build fails on Maven javadoc generation when using JDK11

2021-06-24 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369024#comment-17369024
 ] 

Charles Natali commented on MESOS-10129:


Hey [~csaltos] , sorry for the long delay.

 

[~surahman] could you maybe have a look at this?

I usually build with {{--disable-java}} but since [~csaltos] provided an MR and 
the fix is one-line, it'd be good to merge.

Maybe just try and reproduce the problem on your machine?

> Build fails on Maven javadoc generation when using JDK11
> 
>
> Key: MESOS-10129
> URL: https://issues.apache.org/jira/browse/MESOS-10129
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: master, 1.10.0
> Environment: Debian 10 Buster (2020-04-29) with OpenJdk 11.0.7 
> (2020-04-14)
>Reporter: Carlos Saltos
>Priority: Major
>  Labels: Java11, beginner, build, java11, jdk11
> Attachments: mesos.10.0.maven.javadoc.fix.patch
>
>
> h3. CURRENT BEHAVIOR:
> When using Java 11 (or newer versions) the Javadoc generation step fails with 
> the error:
> {{[ERROR] Failed to execute goal 
> org.apache.maven.plugins:maven-javadoc-plugin:2.8.1:jar 
> (build-and-attach-javadocs) on project mesos: MavenReportException: Error 
> while creating archive:}}
> {{[ERROR] Exit code: 1 - javadoc: error - The code being documented uses 
> modules but the packages defined in 
> http://download.oracle.com/javase/6/docs/api/ are in the unnamed module.}}
> {{[ERROR]}}
> {{[ERROR] Command line was: /usr/lib/jvm/java-11-openjdk-amd64/bin/javadoc 
> @options}}
> {{[ERROR]}}
> {{[ERROR] Refer to the generated Javadoc files in 
> '/home/admin/mesos-deb-packaging/mesos-repo/build/src/java/target/apidocs' 
> dir.}}
> {{[ERROR] -> [Help 1]}}
> {{[ERROR]}}
> {{[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
> switch.}}
> {{[ERROR] Re-run Maven using the -X switch to enable full debug logging.}}
> {{[ERROR]}}
> {{[ERROR] For more information about the errors and possible solutions, 
> please read the following articles:}}
> {{[ERROR] [Help 1] 
> http://cwiki.apache.org/confluence/display/MAVEN/MojoExecutionException}}
> {{make[1]: *** [Makefile:17533: java/target/mesos-1.11.0.jar] Error 1}}
> {{make[1]: Leaving directory 
> '/home/admin/mesos-deb-packaging/mesos-repo/build/src'}}
> {{make: *** [Makefile:785: all-recursive] Error 1}}
> *NOTE:* The error is at the Maven javadoc plugin call when it tries to 
> include references to the non-existant old Java 6 documentation.
> h3. POSSIBLE SOLUTION:
> Just remove the old reference with adding 
> false to the  javadoc maven plugin 
> configuration section



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9950) memory cgroup gone before isolator cleaning up

2021-06-24 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17369023#comment-17369023
 ] 

Charles Natali commented on MESOS-9950:
---

[~subhajitpalit] 

 

So did you check the systemd configuration?

> memory cgroup gone before isolator cleaning up
> --
>
> Key: MESOS-9950
> URL: https://issues.apache.org/jira/browse/MESOS-9950
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: longfei
>Priority: Major
>
> The memcg created by mesos may have been deleted before cgroup/memory 
> isolator cleaning up.
> This would let the termination fail and lose information in the old 
> termination(before fail). 
> {code:java}
> I0821 15:16:03.025796 3354800 paths.cpp:745] Creating sandbox 
> '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89'
>  for user 'tiger'
> I0821 15:16:03.026199 3354800 paths.cpp:748] Creating sandbox 
> '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/meta/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89'
> I0821 15:16:03.026304 3354800 slave.cpp:9064] Launching executor 
> 'mt:z03584687:1' of framework 
> 8e4967e5-736e-4a22-90c3-7b32d526914d- with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
>  in work directory 
> '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89'
> I0821 15:16:03.051795 3354800 slave.cpp:3520] Launching container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89 for executor 
> 'mt:z03584687:1' of framework 
> 8e4967e5-736e-4a22-90c3-7b32d526914d-
> I0821 15:16:03.076608 3354807 containerizer.cpp:1325] Starting container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.076911 3354807 containerizer.cpp:3185] Transitioning the state 
> of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from PROVISIONING to 
> PREPARING
> I0821 15:16:03.077906 3354802 memory.cpp:478] Started listening for OOM 
> events for container a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.079540 3354804 memory.cpp:198] Updated 
> 'memory.soft_limit_in_bytes' to 4032MB for container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.079587 3354820 cpu.cpp:92] Updated 'cpu.shares' to 1126 (cpus 
> 1.1) for container a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.079589 3354804 memory.cpp:227] Updated 'memory.limit_in_bytes' 
> to 4032MB for container a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.080901 3354802 switchboard.cpp:316] Container logger module 
> finished preparing container a0706ca0-fe2c-4477-8161-329b26ea5d89; 
> IOSwitchboard server is not required
> I0821 15:16:03.081593 3354801 linux_launcher.cpp:492] Launching container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89 and cloning with namespaces
> I0821 15:16:03.083823 3354808 containerizer.cpp:2107] Checkpointing 
> container's forked pid 1857418 to 
> '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/meta/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89/pids/forked.pid'
> I0821 15:16:03.084156 3354808 containerizer.cpp:3185] Transitioning the state 
> of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from PREPARING to ISOLATING
> I0821 15:16:03.091468 3354808 containerizer.cpp:3185] Transitioning the state 
> of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from ISOLATING to FETCHING
> I0821 15:16:03.094933 3354808 containerizer.cpp:3185] Transitioning the state 
> of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from FETCHING to RUNNING
> I0821 15:16:03.197753 3354808 memory.cpp:198] Updated 
> 'memory.soft_limit_in_bytes' to 4032MB for container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.197757 3354801 cpu.cpp:92] Updated 'cpu.shares' to 1126 (cpus 
> 1.1) for container a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:21:39.692978 3354814 memory.cpp:515] OOM detected for container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:21:39.693182 3354805 containerizer.cpp:3044] Container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89 has reached its limit for resource [] 
> and will be terminated
> I0821 15:21:39.693192 3354805 containerizer.cpp:2518] Destroying 

[jira] [Assigned] (MESOS-10223) Crashes on ARM64 due to bad interaction of libunwind with libgcc.

2021-06-24 Thread Charles Natali (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Natali reassigned MESOS-10223:
--

Assignee: Charles Natali
 Summary: Crashes on ARM64 due to bad interaction of libunwind with libgcc. 
  (was: Test failures on Linux ARM64)

Thanks [~mgrigorov] , unfortunately those logs aren't really helpful because 
they just show that the test hangs, but don't show which test.

The actual log for the tests can be obtained by running e.g.:

 
{noformat}
# ./bin/mesos-tests.sh --verbose{noformat}
 

 

Note that I can actually reproduce this hang with master on my machine, so it 
is very likely unrelated to this problem and not ARM64-specific.

I'll try to address in a separate issue.

I've created a PR for the original crash: 
https://github.com/apache/mesos/pull/395

 

> Crashes on ARM64 due to bad interaction of libunwind with libgcc. 
> --
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Assignee: Charles Natali
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz, sudo_make_check_output.txt
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d 

[jira] [Comment Edited] (MESOS-10223) Test failures on Linux ARM64

2021-06-23 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368370#comment-17368370
 ] 

Charles Natali edited comment on MESOS-10223 at 6/23/21, 6:00 PM:
--

bq. After running for 3 hours make check failed on two shards with:

Yeah this error is fine and unrelated.

Did the root one finish?





was (Author: cf.natali):
Yeah this error is fine and unrelated.

Did the root one finish?

On Wed, 23 Jun 2021, 14:23 Martin Tzvetanov Grigorov (Jira), <



> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 0x91353dac abort
> @ 0x91569848 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 

[jira] [Comment Edited] (MESOS-10223) Test failures on Linux ARM64

2021-06-23 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368371#comment-17368371
 ] 

Charles Natali edited comment on MESOS-10223 at 6/23/21, 6:00 PM:
--

{quote}I may found another issue today.

I tried to build Mesos with *make -j $(nproc)*, i.e. 8, and it failed with:
{quote}
 

You're probably running out of memory when the build parallelism is too
 high, the compilation is quite memory intensive.


was (Author: cf.natali):
You're probably running out of memory when the build parallelism is too
high, the compilation is quite memory intensive.

On Wed, 23 Jun 2021, 11:32 Martin Tzvetanov Grigorov (Jira), <



> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 

[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.

2021-06-23 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368386#comment-17368386
 ] 

Charles Natali commented on MESOS-10224:


I'd go for the last option, i.e. return error only if the data pointer is past 
the end of the buffer.

> What are your thoughts? All of the above are quick adjustments but they 
> weaken the original checks.

Yes but it's fine, since the original check doesn't work anymore :).

> [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
> -
>
> Key: MESOS-10224
> URL: https://issues.apache.org/jira/browse/MESOS-10224
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.11.0
>Reporter: Saad Ur Rahman
>Priority: Major
> Attachments: ld.so.cache
>
>
> *OS:* Ubuntu 21.04
> *Command:*
> {code:java}
> make -j 6 V=0 check{code}
> Fails during the build and test suite run on two different machines with the 
> same OS.
> {code:java}
> 3: [   OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms)
> 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest 
> (27011 ms total)
> 3: 
> 3: [--] Global test environment tear-down
> 3: [==] 575 tests from 178 test cases ran. (202572 ms total)
> 3: [  PASSED  ] 573 tests.
> 3: [  FAILED  ] 2 tests, listed below:
> 3: [  FAILED  ] LdcacheTest.Parse
> 3: [  FAILED  ] 
> CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where 
> GetParam() = "v0"
> 3: 
> 3:  2 FAILED TESTS
> 3:   YOU HAVE 34 DISABLED TESTS
> 3: 
> 3: 
> 3: 
> 3: [FAIL]: 4 shard(s) have failed tests
> 3/3 Test #3: MesosTests ...***Failed  1173.43 sec
> {code}
> Are there any pre-requisites required to get the build/tests to pass? I am 
> trying to get all the tests to pass to make sure my build environment is 
> setup correctly for development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64

2021-06-23 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368370#comment-17368370
 ] 

Charles Natali commented on MESOS-10223:


Yeah this error is fine and unrelated.

Did the root one finish?

On Wed, 23 Jun 2021, 14:23 Martin Tzvetanov Grigorov (Jira), <



> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 0x91353dac abort
> @ 0x91569848 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 0x915675b0 __cxa_rethrow
> @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 0x91567544 

[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64

2021-06-23 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17368371#comment-17368371
 ] 

Charles Natali commented on MESOS-10223:


You're probably running out of memory when the build parallelism is too
high, the compilation is quite memory intensive.

On Wed, 23 Jun 2021, 11:32 Martin Tzvetanov Grigorov (Jira), <



> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 0x91353dac abort
> @ 0x91569848 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 0x915675b0 __cxa_rethrow
> @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 

[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.

2021-06-22 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367602#comment-17367602
 ] 

Charles Natali commented on MESOS-10224:


Actually [~surahman] you should go ahead, it's a nice and easy fix!
The problematic code is here: 
https://github.com/apache/mesos/blob/master/src/linux/ldcache.cpp#L227

The code expects that the file ends after the last entry, whereas in your case 
it's not true since there's this description string at the end of the file.


> [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
> -
>
> Key: MESOS-10224
> URL: https://issues.apache.org/jira/browse/MESOS-10224
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.11.0
>Reporter: Saad Ur Rahman
>Priority: Major
> Attachments: ld.so.cache
>
>
> *OS:* Ubuntu 21.04
> *Command:*
> {code:java}
> make -j 6 V=0 check{code}
> Fails during the build and test suite run on two different machines with the 
> same OS.
> {code:java}
> 3: [   OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms)
> 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest 
> (27011 ms total)
> 3: 
> 3: [--] Global test environment tear-down
> 3: [==] 575 tests from 178 test cases ran. (202572 ms total)
> 3: [  PASSED  ] 573 tests.
> 3: [  FAILED  ] 2 tests, listed below:
> 3: [  FAILED  ] LdcacheTest.Parse
> 3: [  FAILED  ] 
> CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where 
> GetParam() = "v0"
> 3: 
> 3:  2 FAILED TESTS
> 3:   YOU HAVE 34 DISABLED TESTS
> 3: 
> 3: 
> 3: 
> 3: [FAIL]: 4 shard(s) have failed tests
> 3/3 Test #3: MesosTests ...***Failed  1173.43 sec
> {code}
> Are there any pre-requisites required to get the build/tests to pass? I am 
> trying to get all the tests to pass to make sure my build environment is 
> setup correctly for development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.

2021-06-22 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367584#comment-17367584
 ] 

Charles Natali commented on MESOS-10224:


Ah, here's the problem, looks like Ubuntu adds some crap at the end of the 
cache.
Let's look at the end of the file.

Mine - Debian - ends with an entry and then the NUL byte:

{noformat}
cf@thinkpad:~/src/mesos$ hexdump -C /etc/ld.so.cache.default | tail
00014d20  36 00 2f 75 73 72 2f 6c  69 62 2f 6c 69 62 42 4c  |6./usr/lib/libBL|
00014d30  54 6c 69 74 65 2e 32 2e  35 2e 73 6f 2e 38 2e 36  |Tlite.2.5.so.8.6|
00014d40  00 6c 69 62 42 4c 54 2e  32 2e 35 2e 73 6f 2e 38  |.libBLT.2.5.so.8|
00014d50  2e 36 00 2f 75 73 72 2f  6c 69 62 2f 6c 69 62 42  |.6./usr/lib/libB|
00014d60  4c 54 2e 32 2e 35 2e 73  6f 2e 38 2e 36 00 6c 64  |LT.2.5.so.8.6.ld|
00014d70  2d 6c 69 6e 75 78 2d 78  38 36 2d 36 34 2e 73 6f  |-linux-x86-64.so|
00014d80  2e 32 00 2f 6c 69 62 2f  78 38 36 5f 36 34 2d 6c  |.2./lib/x86_64-l|
00014d90  69 6e 75 78 2d 67 6e 75  2f 6c 64 2d 6c 69 6e 75  |inux-gnu/ld-linu|
00014da0  78 2d 78 38 36 2d 36 34  2e 73 6f 2e 32 00|x-x86-64.so.2.|
00014dae
{noformat}


Yours - ends with some random strings at the end:


{noformat}
cf@thinkpad:~/src/mesos$ hexdump -C /etc/ld.so.cache | tail
000130c0  6f 2e 30 2e 30 00 2f 6c  69 62 2f 78 38 36 5f 36  |o.0.0./lib/x86_6|
000130d0  34 2d 6c 69 6e 75 78 2d  67 6e 75 2f 6c 69 62 67  |4-linux-gnu/libg|
000130e0  63 69 2d 31 2e 73 6f 2e  30 2e 30 2e 30 00 00 00  |ci-1.so.0.0.0...|
000130f0  74 21 a4 ea 01 00 00 00  00 00 00 00 00 00 00 00  |t!..|
00013100  08 31 01 00 42 00 00 00  6c 64 63 6f 6e 66 69 67  |.1..B...ldconfig|
00013110  20 28 55 62 75 6e 74 75  20 47 4c 49 42 43 20 32  | (Ubuntu GLIBC 2|
00013120  2e 33 33 2d 30 75 62 75  6e 74 75 35 29 20 72 65  |.33-0ubuntu5) re|
00013130  6c 65 61 73 65 20 72 65  6c 65 61 73 65 20 76 65  |lease release ve|
00013140  72 73 69 6f 6e 20 32 2e  33 33|rsion 2.33|
0001314a
{noformat}

Trivial to fix, give me a minute...

> [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
> -
>
> Key: MESOS-10224
> URL: https://issues.apache.org/jira/browse/MESOS-10224
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.11.0
>Reporter: Saad Ur Rahman
>Priority: Major
> Attachments: ld.so.cache
>
>
> *OS:* Ubuntu 21.04
> *Command:*
> {code:java}
> make -j 6 V=0 check{code}
> Fails during the build and test suite run on two different machines with the 
> same OS.
> {code:java}
> 3: [   OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms)
> 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest 
> (27011 ms total)
> 3: 
> 3: [--] Global test environment tear-down
> 3: [==] 575 tests from 178 test cases ran. (202572 ms total)
> 3: [  PASSED  ] 573 tests.
> 3: [  FAILED  ] 2 tests, listed below:
> 3: [  FAILED  ] LdcacheTest.Parse
> 3: [  FAILED  ] 
> CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where 
> GetParam() = "v0"
> 3: 
> 3:  2 FAILED TESTS
> 3:   YOU HAVE 34 DISABLED TESTS
> 3: 
> 3: 
> 3: 
> 3: [FAIL]: 4 shard(s) have failed tests
> 3/3 Test #3: MesosTests ...***Failed  1173.43 sec
> {code}
> Are there any pre-requisites required to get the build/tests to pass? I am 
> trying to get all the tests to pass to make sure my build environment is 
> setup correctly for development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.

2021-06-22 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367579#comment-17367579
 ] 

Charles Natali commented on MESOS-10224:


No it should really work, it's a bit strange.
Possible there's something special about your cache, but it looks valid since I 
can parse it using {{ldconfig -p}}.

Shouldn't be too difficult to fix, hopefully.

> [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
> -
>
> Key: MESOS-10224
> URL: https://issues.apache.org/jira/browse/MESOS-10224
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.11.0
>Reporter: Saad Ur Rahman
>Priority: Major
> Attachments: ld.so.cache
>
>
> *OS:* Ubuntu 21.04
> *Command:*
> {code:java}
> make -j 6 V=0 check{code}
> Fails during the build and test suite run on two different machines with the 
> same OS.
> {code:java}
> 3: [   OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms)
> 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest 
> (27011 ms total)
> 3: 
> 3: [--] Global test environment tear-down
> 3: [==] 575 tests from 178 test cases ran. (202572 ms total)
> 3: [  PASSED  ] 573 tests.
> 3: [  FAILED  ] 2 tests, listed below:
> 3: [  FAILED  ] LdcacheTest.Parse
> 3: [  FAILED  ] 
> CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where 
> GetParam() = "v0"
> 3: 
> 3:  2 FAILED TESTS
> 3:   YOU HAVE 34 DISABLED TESTS
> 3: 
> 3: 
> 3: 
> 3: [FAIL]: 4 shard(s) have failed tests
> 3/3 Test #3: MesosTests ...***Failed  1173.43 sec
> {code}
> Are there any pre-requisites required to get the build/tests to pass? I am 
> trying to get all the tests to pass to make sure my build environment is 
> setup correctly for development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10224) [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.

2021-06-22 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367572#comment-17367572
 ] 

Charles Natali commented on MESOS-10224:


Interesting, I can reproduce it - I'll have a look.

> [test] CSIVersion/StorageLocalResourceProviderTest.OperationUpdate fails.
> -
>
> Key: MESOS-10224
> URL: https://issues.apache.org/jira/browse/MESOS-10224
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.11.0
>Reporter: Saad Ur Rahman
>Priority: Major
> Attachments: ld.so.cache
>
>
> *OS:* Ubuntu 21.04
> *Command:*
> {code:java}
> make -j 6 V=0 check{code}
> Fails during the build and test suite run on two different machines with the 
> same OS.
> {code:java}
> 3: [   OK ] CSIVersion/StorageLocalResourceProviderTest.Update/v0 (479 ms)
> 3: [--] 14 tests from CSIVersion/StorageLocalResourceProviderTest 
> (27011 ms total)
> 3: 
> 3: [--] Global test environment tear-down
> 3: [==] 575 tests from 178 test cases ran. (202572 ms total)
> 3: [  PASSED  ] 573 tests.
> 3: [  FAILED  ] 2 tests, listed below:
> 3: [  FAILED  ] LdcacheTest.Parse
> 3: [  FAILED  ] 
> CSIVersion/StorageLocalResourceProviderTest.OperationUpdate/v0, where 
> GetParam() = "v0"
> 3: 
> 3:  2 FAILED TESTS
> 3:   YOU HAVE 34 DISABLED TESTS
> 3: 
> 3: 
> 3: 
> 3: [FAIL]: 4 shard(s) have failed tests
> 3/3 Test #3: MesosTests ...***Failed  1173.43 sec
> {code}
> Are there any pre-requisites required to get the build/tests to pass? I am 
> trying to get all the tests to pass to make sure my build environment is 
> setup correctly for development.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64

2021-06-22 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367559#comment-17367559
 ] 

Charles Natali commented on MESOS-10223:


By the way, the reason for running it as root is that many tests are only run 
as root (e.g. tests which need cgroups etc), so it'd be nice to make sure they 
pass.

> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 0x91353dac abort
> @ 0x91569848 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 0x915675b0 __cxa_rethrow
> @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
>   

[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64

2021-06-22 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367554#comment-17367554
 ] 

Charles Natali commented on MESOS-10223:


Hey [~mgrigorov]

The attached patch should fix the issue - I ran all the test suite and it 
pretty much passed, however it would be great if you could run it as root with 
the attached patch, just to make sure 
[^0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch] 

There might be some unrelated/transient error though but I'm just interested to 
see that this problem is fixed.

Thanks!

> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: 0001-Fixed-crashes-on-ARM64-due-to-libunwind.patch, 
> mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 0x91353dac abort
> @ 0x91569848 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 

[jira] [Comment Edited] (MESOS-10223) Test failures on Linux ARM64

2021-06-22 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367092#comment-17367092
 ] 

Charles Natali edited comment on MESOS-10223 at 6/22/21, 7:51 AM:
--

{quote}I experienced the errors both on real ARM64 host and with Docker. The 
problem with strace is not related to QEMU. It is a Docker thingy. You need to 
add a capability for it:

 

{{docker run --cap-add=SYS_PTRACE}} ...
{quote}
 

Hm, I don't think so, it's not a capability/seccomp issue: if you look at the 
error it's {{ENOSYS}} not {{EPERM}}.

 

Just to show you:
{noformat}
root@thinkpad:/home/cf/mesos-on-arm64# docker run --cap-add=SYS_PTRACE --rm -it 
-v $PWD/mesos:/mesos bui
ld-mesos-on-arm64 bash 
WARNING: The requested image's platform (linux/arm64) does not match the 
detected host platform (linux/amd64) and no specific platform was requested
root@4d45b9e91754:/mesos# apt install strace
Reading package lists... Done
Building dependency tree 
Reading state information... Done
The following NEW packages will be installed:
 strace 
0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 297 kB of archives.
After this operation, 1336 kB of additional disk space will be used.
Get:1 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 strace 
arm64 5.5-3ubuntu1 [297 kB]
Fetched 297 kB in 1s (327 kB/s)
Selecting previously unselected package strace.
(Reading database ... 18530 files and directories currently installed.)
Preparing to unpack .../strace_5.5-3ubuntu1_arm64.deb ...
Unpacking strace (5.5-3ubuntu1) ...
Setting up strace (5.5-3ubuntu1) ...
root@4d45b9e91754:/mesos# strace ls 
/usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not 
implemented
/usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented
/usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented
/usr/bin/strace: detach: waitpid(115): No child processes
/usr/bin/strace: Process 115 detached
{noformat}


Did you test this docker+qemu image from a non ADM64 host?


> I will send you privately credentials to my ARM64 VM where you can debug it 
> without Docker!

Thanks, that'd be much easier


was (Author: cf.natali):
{quote}I experienced the errors both on real ARM64 host and with Docker. The 
problem with strace is not related to QEMU. It is a Docker thingy. You need to 
add a capability for it:

 

{{docker run --cap-add=SYS_PTRACE}} ...
{quote}
 

Hm, I don't think so, it's not a capability/seccomp issue: if you look at the 
error it's {{ENOSYS}} not {{EPERM}}.

 

Just to show you:
{noformat}
root@thinkpad:/home/cf/mesos-on-arm64# docker run --cap-add=SYS_PTRACE --rm -it 
-v $PWD/mesos:/mesos bui
ld-mesos-on-arm64 bash 
WARNING: The requested image's platform (linux/arm64) does not match the 
detected host platform (linux/amd64) and no specific platform was requested
root@4d45b9e91754:/mesos# apt install strace
Reading package lists... Done
Building dependency tree 
Reading state information... Done
The following NEW packages will be installed:
 strace 
0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 297 kB of archives.
After this operation, 1336 kB of additional disk space will be used.
Get:1 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 strace 
arm64 5.5-3ubuntu1 [297 kB]
Fetched 297 kB in 1s (327 kB/s)
Selecting previously unselected package strace.
(Reading database ... 18530 files and directories currently installed.)
Preparing to unpack .../strace_5.5-3ubuntu1_arm64.deb ...
Unpacking strace (5.5-3ubuntu1) ...
Setting up strace (5.5-3ubuntu1) ...
root@4d45b9e91754:/mesos# strace ls 
/usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not 
implemented
/usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented
/usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented
/usr/bin/strace: detach: waitpid(115): No child processes
/usr/bin/strace: Process 115 detached
{noformat}


Did you test this docker+qemu image from a non ADM64 host?


> I will send you privately credentials to my ARM64 VM where you can debug it 
> without Docker!

> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d 

[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64

2021-06-22 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17367092#comment-17367092
 ] 

Charles Natali commented on MESOS-10223:


{quote}I experienced the errors both on real ARM64 host and with Docker. The 
problem with strace is not related to QEMU. It is a Docker thingy. You need to 
add a capability for it:

 

{{docker run --cap-add=SYS_PTRACE}} ...
{quote}
 

Hm, I don't think so, it's not a capability/seccomp issue: if you look at the 
error it's {{ENOSYS}} not {{EPERM}}.

 

Just to show you:
{noformat}
root@thinkpad:/home/cf/mesos-on-arm64# docker run --cap-add=SYS_PTRACE --rm -it 
-v $PWD/mesos:/mesos bui
ld-mesos-on-arm64 bash 
WARNING: The requested image's platform (linux/arm64) does not match the 
detected host platform (linux/amd64) and no specific platform was requested
root@4d45b9e91754:/mesos# apt install strace
Reading package lists... Done
Building dependency tree 
Reading state information... Done
The following NEW packages will be installed:
 strace 
0 upgraded, 1 newly installed, 0 to remove and 6 not upgraded.
Need to get 297 kB of archives.
After this operation, 1336 kB of additional disk space will be used.
Get:1 http://ports.ubuntu.com/ubuntu-ports focal-updates/main arm64 strace 
arm64 5.5-3ubuntu1 [297 kB]
Fetched 297 kB in 1s (327 kB/s)
Selecting previously unselected package strace.
(Reading database ... 18530 files and directories currently installed.)
Preparing to unpack .../strace_5.5-3ubuntu1_arm64.deb ...
Unpacking strace (5.5-3ubuntu1) ...
Setting up strace (5.5-3ubuntu1) ...
root@4d45b9e91754:/mesos# strace ls 
/usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not 
implemented
/usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented
/usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented
/usr/bin/strace: detach: waitpid(115): No child processes
/usr/bin/strace: Process 115 detached
{noformat}


Did you test this docker+qemu image from a non ADM64 host?


> I will send you privately credentials to my ARM64 VM where you can debug it 
> without Docker!

> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> 

[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64

2021-06-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17366756#comment-17366756
 ] 

Charles Natali commented on MESOS-10223:


[~mgrigorov]

Finally had time to have a look.

 

Using your docker image I don't reproduce it:

 
{noformat}
[ RUN ] JsonTest.Find 
[ OK ] JsonTest.Find (21 ms) 
{noformat}
 

I am however seeing other failures like:

 

 
{noformat}
[ RUN ] ProcessTest.Processes 
../../../3rdparty/stout/tests/os/process_tests.cpp:139: Failure 
 Expected: getppid() 
 Which is: 1 
To be equal to: process.parent 
 Which is: 0 
../../../3rdparty/stout/tests/os/process_tests.cpp:144: Failure 
 Expected: getsid(0) 
 Which is: 1 
To be equal to: process.session.get() 
 Which is: 0 
../../../3rdparty/stout/tests/os/process_tests.cpp:148: Failure 
Expected: (process.rss.get()) > (0), actual: 0B vs 0 
[ FAILED ] ProcessTest.Processes (9 ms)
 
{noformat}
 

However they can be do the fact that it's running inside docker/Qemu.

 

Qemu has to do syscall ABI translation, and for example it doesn't support 
ptrace:

 
{noformat}
root@4d45b9e91754:/mesos# strace /bin/true 
/usr/bin/strace: test_ptrace_get_syscall_info: PTRACE_TRACEME: Function not 
implemented
/usr/bin/strace: ptrace(PTRACE_TRACEME, ...): Function not implemented
/usr/bin/strace: PTRACE_SETOPTIONS: Function not implemented
/usr/bin/strace: detach: waitpid(124): No child processes
/usr/bin/strace: Process 124 detached
root@4d45b9e91754:/mesos#
{noformat}
 

So it can very well cause the other failures above (which I can't debug without 
strace or gdb...).

 

Regarding your original problem, is it on a real ARM64 host or within 
docker/qemu etc?

> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> 

[jira] [Commented] (MESOS-10159) Running unit test command hangs

2021-06-17 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10159?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365141#comment-17365141
 ] 

Charles Natali commented on MESOS-10159:


[~jineshpatel]

 

I know it's been a while but if you're still experiencing the issue please let 
us know, otherwise I'll close this ticket.

 

Cheers,

> Running unit test command hangs
> ---
>
> Key: MESOS-10159
> URL: https://issues.apache.org/jira/browse/MESOS-10159
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: OS: Ubuntu 20.04
> Arch: Intel
>Reporter: Jinesh Patel
>Priority: Minor
>  Labels: test
>
> Running the `make check` command to execute mesos test cases hangs after 
> printing failed test results. The process doesn't hang if all test cases pass.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9713) Support specifying output file name for URI fetcher

2021-06-17 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9713?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365096#comment-17365096
 ] 

Charles Natali commented on MESOS-9713:
---

Great!

 

Can't find any way to send a DM on here so here's my email: cf.natali _at_ 
gmail.com

Shoot me an email and we can chat a bit about what you could do!

> Support specifying output file name for URI fetcher
> ---
>
> Key: MESOS-9713
> URL: https://issues.apache.org/jira/browse/MESOS-9713
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher
>Reporter: Qian Zhang
>Priority: Major
>  Labels: newbie
>
> Currently URI fetcher's `fetch` method is defined like below:
> {code:java}
>   process::Future fetch(
>       const URI& uri,
>       const std::string& directory,
>       const Option& data = None()) const;
> {code}
> So caller can only specify the directory that the URI will be downloaded to 
> but not the name of the output file which has to be same with base name of 
> the URI path. We'd better to introduce an output file name parameter so that 
> caller can customize the output file name.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10211) mesos agent crashes every time when launched tensorboard in a horovod image with mesos container

2021-06-17 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365087#comment-17365087
 ] 

Charles Natali commented on MESOS-10211:


[~ggmmggmm2] so could you give more details?

> mesos agent crashes every time when launched tensorboard in a horovod image 
> with mesos container
> 
>
> Key: MESOS-10211
> URL: https://issues.apache.org/jira/browse/MESOS-10211
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.11.0
> Environment: agent:ubuntu18.04
>Reporter: YZ sun
>Priority: Critical
>
> When launch a task using image 
> "horovod/horovod:0.20.0-tf2.3.0-torch1.6.0-mxnet1.6.0.post0-py3.7-cuda10.1",
> if tensorboard in this image is started,
> the agent node will immediately crash every time.
> if tensorboard is not started by command, mesos will just work as expected.
> agent log looks like below:
> {code:java}
> //agent crash
> I0127 16:07:21.860065 30960 slave.cpp:3181] Launching task 
> 'baseEnvSingle_gpunode1' for framework baseDevEnv_root_1611734806
> F0127 16:07:21.860143 30960 slave.cpp:3194] Check failed: executor == nullptr
> *** Check failure stack trace: ***
> @ 0x7f2bcc4221fc  google::LogMessage::Fail()
> @ 0x7f2bcc422145  google::LogMessage::SendToLog()
> @ 0x7f2bcc421ad1  google::LogMessage::Flush()
> @ 0x7f2bcc4251e8  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f2bca4cb10b  mesos::internal::slave::Slave::__run()
> @ 0x7f2bca570ac6  
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSB_INS1_13TaskGroupInfoEERKSt6vectorINS2_19ResourceVersionUUIDESaISL_EERKSB_IbEbS7_SA_SF_SJ_SP_SS_bEEvRKNS_3PIDIT_EEMSU_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_ENKUlOS5_OS8_OSD_OSH_OSN_OSQ_ObPNS_11ProcessBaseEE_clES1L_S1M_S1N_S1O_S1P_S1Q_S1R_S1T_
> @ 0x7f2bca663b01  
> _ZN5cpp176invokeIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS3_13FrameworkInfoERKNS3_12ExecutorInfoERK6OptionINS3_8TaskInfoEERKSD_INS3_13TaskGroupInfoEERKSt6vectorINS4_19ResourceVersionUUIDESaISN_EERKSD_IbEbS9_SC_SH_SL_SR_SU_bEEvRKNS1_3PIDIT_EEMSW_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS7_OSA_OSF_OSJ_OSP_OSS_ObPNS1_11ProcessBaseEE_JS7_SA_SF_SJ_SP_SS_bS1V_EEEDTclcl7forwardISW_Efp_Espcl7forwardIT0_Efp0_EEEOSW_DpOS1X_
> @ 0x7f2bca6555dc  
> _ZN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi113invoke_expandIS1X_St5tupleIJS8_SB_SG_SK_SQ_ST_bS1Z_EES22_IJOS1W_EEJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7DTcl6invokecl7forwardISX_Efp_Espcl6expandcl3getIXT2_EEcl7forwardIS11_Efp0_EEcl7forwardIS12_Efp2_OSX_OS11_N5cpp1416integer_sequenceImJXspT2_OS12_
> @ 0x7f2bca64da94  
> _ZNO6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS4_13FrameworkInfoERKNS4_12ExecutorInfoERK6OptionINS4_8TaskInfoEERKSE_INS4_13TaskGroupInfoEERKSt6vectorINS5_19ResourceVersionUUIDESaISO_EERKSE_IbEbSA_SD_SI_SM_SS_SV_bEEvRKNS2_3PIDIT_EEMSX_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOS8_OSB_OSG_OSK_OSQ_OST_ObPNS2_11ProcessBaseEE_JS8_SB_SG_SK_SQ_ST_bSt12_PlaceholderILi1clIJS1W_EEEDTcl13invoke_expandcl4movedtdefpT1fEcl4movedtdefpT10bound_argsEcvN5cpp1416integer_sequenceImJLm0ELm1ELm2ELm3ELm4ELm5ELm6ELm7_Ecl16forward_as_tuplespcl7forwardIT_Efp_DpOS25_
> @ 0x7f2bca647e56  
> _ZN5cpp176invokeIN6lambda8internal7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS6_13FrameworkInfoERKNS6_12ExecutorInfoERK6OptionINS6_8TaskInfoEERKSG_INS6_13TaskGroupInfoEERKSt6vectorINS7_19ResourceVersionUUIDESaISQ_EERKSG_IbEbSC_SF_SK_SO_SU_SX_bEEvRKNS4_3PIDIT_EEMSZ_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSA_OSD_OSI_OSM_OSS_OSV_ObPNS4_11ProcessBaseEE_JSA_SD_SI_SM_SS_SV_bSt12_PlaceholderILi1EJS1Y_EEEDTclcl7forwardISZ_Efp_Espcl7forwardIT0_Efp0_EEEOSZ_DpOS23_
> @ 0x7f2bca645145  
> _ZN6lambda8internal6InvokeIvEclINS0_7PartialIZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS7_13FrameworkInfoERKNS7_12ExecutorInfoERK6OptionINS7_8TaskInfoEERKSH_INS7_13TaskGroupInfoEERKSt6vectorINS8_19ResourceVersionUUIDESaISR_EERKSH_IbEbSD_SG_SL_SP_SV_SY_bEEvRKNS5_3PIDIT_EEMS10_FvT0_T1_T2_T3_T4_T5_T6_EOT7_OT8_OT9_OT10_OT11_OT12_OT13_EUlOSB_OSE_OSJ_OSN_OST_OSW_ObPNS5_11ProcessBaseEE_JSB_SE_SJ_SN_ST_SW_bSt12_PlaceholderILi1EJS1Z_EEEvOS10_DpOT0_
> @ 

[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values

2021-06-17 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365085#comment-17365085
 ] 

Charles Natali commented on MESOS-10216:


[~asekretenko]

so what do you think, can we close?

> Replicated log key encoding overflows into negative values
> --
>
> Key: MESOS-10216
> URL: https://issues.apache.org/jira/browse/MESOS-10216
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Affects Versions: 1.7.3, 1.8.1, 1.9.1, 1.11.0, 1.10.1, 1.12.0
>Reporter: Ilya
>Assignee: Charles Natali
>Priority: Major
> Fix For: 1.12.0
>
>
> LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions 
> encoded as strings and padded with zeroes up to a certain fixed size. The 
> {{encode()}} function is incorrect because it uses the {{%d}} formatter that 
> expects an {{int}}. It also limits the key size to 10 digits which is OK for 
> {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}.
> Because of this the available key range is reduced, and key overflow can 
> result in replica's {{METADATA}} record (position 0) being overwritten, which 
> in turn may cause data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10219) 1.11.0 does not build on Windows

2021-06-17 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17365076#comment-17365076
 ] 

Charles Natali commented on MESOS-10219:


[~acecile555]

 

As [~apeters]  mentioned, the project is currently very short-staffed.

It was actually quite close to shutting down just a few weeks ago, see 
[https://lists.apache.org/thread.html/rab2a820507f7c846e54a847398ab20f47698ec5bce0c8e182bfe51ba%40%3Cdev.mesos.apache.org%3E]

 

Right now it's just Andreas and me actively contributing, and a couple other 
committers who do reviews.

 

And unfortunately, I don't have any experience with Windows. And to be honest, 
I managed to avoid working with Windows in my 15 years experience, so I'm not 
really motivated to spend hours to debug some Windows-specific issues, sorry.

 

So I think your options are:
 # Continue the work you've been doing. You've been making progress, hopefully 
you'll get there eventually, and learn in the process. As a bonus, it lets you 
familiarize yourself with the code base, in case you'd be interested to 
contribute to the project on an ongoing basis.
 # Ask for help on the users and developers mailing lists 
([http://mesos.apache.org/community/#mailing-lists)] - maybe someone who knows 
Windows will be willing to help.
 # Give up.

 

>From my point of view while I'm not willing to spend days learning about 
>Windows and its various cryptic APIs, I'll be willing to review your changes 
>and help get them merge.

 

Good luck!

> 1.11.0 does not build on Windows
> 
>
> Key: MESOS-10219
> URL: https://issues.apache.org/jira/browse/MESOS-10219
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, build, cmake
>Affects Versions: 1.11.0
>Reporter: acecile555
>Priority: Major
> Attachments: mesos_slave_windows_longpath.png, 
> patch_1.10.0_windows_build.diff
>
>
> Hello,
>  
> I just tried building Mesos 1.11.0 on Windows and this is not working.
>  
> The first issue is libarchive compilation that can be easily workarounded by 
> adding the following hunk to 3rdparty/libarchive-3.3.2.patch:
> {noformat}
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -137,7 +137,7 @@
># This is added into CMAKE_C_FLAGS when CMAKE_BUILD_TYPE is "Debug"
># Enable level 4 C4061: The enumerate has no associated handler in a switch
>#   statement.
> -  SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4061")
> +  #SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4061")
># Enable level 4 C4254: A larger bit field was assigned to a smaller bit
>#   field.
>SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4254")
> {noformat}
> Sadly it is failing later with issue I cannot solve myself:
> {noformat}
> C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot 
> open include file: 'csi/state.pb.h': No such file or directory (compiling 
> source file C:\Users\earthlab\mesos\src\slave\csi_server.cpp) 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   qos_controller.cpp
>   resource_estimator.cpp
>   slave.cpp
>   state.cpp
>   task_status_update_manager.cpp
>   sandbox.cpp
> C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot 
> open include file: 'csi/state.pb.h': No such file or directory (compiling 
> source file C:\Users\earthlab\mesos\src\slave\slave.cpp) 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   composing.cpp
>   isolator.cpp
> C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot 
> open include file: 'csi/state.pb.h': No such file or directory (compiling 
> source file C:\Users\earthlab\mesos\src\slave\task_status_update_manager.cpp) 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   isolator_tracker.cpp
>   launch.cpp
> C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot 
> open include file: 'csi/state.pb.h': No such file or directory (compiling 
> source file C:\Users\earthlab\mesos\src\slave\containerizer\composing.cpp) 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   launcher.cpp
> C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp(524,34): 
> error C2668: 'os::spawn': ambiguous call to overloaded function 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
> C:\Users\earthlab\mesos\3rdparty\stout\include\stout/os/exec.hpp(52,20): 
> message : could be 'Option os::spawn(const std::string &,const 
> std::vector> &)' 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   with
>   [
>   T=int
>   ] (compiling source file 
> C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp)
> C:\Users\earthlab\mesos\3rdparty\stout\include\stout/os/windows/exec.hpp(412,20):
>  message : or   'Option os::spawn(const 

[jira] [Commented] (MESOS-10219) 1.11.0 does not build on Windows

2021-06-17 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17364769#comment-17364769
 ] 

Charles Natali commented on MESOS-10219:


Hey,

 

I'm fairly proficient with POSIX and C++, however I don't know anything about 
windows.

I can try to have a look, however it's not clear to me whether this builds or 
not on windows: when you say

 
{quote}From src/slave/containerizer/mesos/isolators/filesystem/posix.cpp

It's raising the exception saying the file does not exist.
{quote}
 

I assume you mean that the agent logs an error at runtime?

If yes then I don't think it's supposed to work at all on Windows, I don't 
think you should be using the posix isolator on Windows - the doc mentions 
([https://mesos.apache.org/documentation/latest/configuration/agent/):]

 
{noformat}
default: windows/cpu,windows/mem on Windows; posix/cpu,posix/mem on other 
platforms) {noformat}
 

 

 

> 1.11.0 does not build on Windows
> 
>
> Key: MESOS-10219
> URL: https://issues.apache.org/jira/browse/MESOS-10219
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, build, cmake
>Affects Versions: 1.11.0
>Reporter: acecile555
>Priority: Major
> Attachments: patch_1.10.0_windows_build.diff
>
>
> Hello,
>  
> I just tried building Mesos 1.11.0 on Windows and this is not working.
>  
> The first issue is libarchive compilation that can be easily workarounded by 
> adding the following hunk to 3rdparty/libarchive-3.3.2.patch:
> {noformat}
> --- a/CMakeLists.txt
> +++ b/CMakeLists.txt
> @@ -137,7 +137,7 @@
># This is added into CMAKE_C_FLAGS when CMAKE_BUILD_TYPE is "Debug"
># Enable level 4 C4061: The enumerate has no associated handler in a switch
>#   statement.
> -  SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4061")
> +  #SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4061")
># Enable level 4 C4254: A larger bit field was assigned to a smaller bit
>#   field.
>SET(CMAKE_C_FLAGS_DEBUG "${CMAKE_C_FLAGS_DEBUG} /we4254")
> {noformat}
> Sadly it is failing later with issue I cannot solve myself:
> {noformat}
> C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot 
> open include file: 'csi/state.pb.h': No such file or directory (compiling 
> source file C:\Users\earthlab\mesos\src\slave\csi_server.cpp) 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   qos_controller.cpp
>   resource_estimator.cpp
>   slave.cpp
>   state.cpp
>   task_status_update_manager.cpp
>   sandbox.cpp
> C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot 
> open include file: 'csi/state.pb.h': No such file or directory (compiling 
> source file C:\Users\earthlab\mesos\src\slave\slave.cpp) 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   composing.cpp
>   isolator.cpp
> C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot 
> open include file: 'csi/state.pb.h': No such file or directory (compiling 
> source file C:\Users\earthlab\mesos\src\slave\task_status_update_manager.cpp) 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   isolator_tracker.cpp
>   launch.cpp
> C:\Users\earthlab\mesos\src\csi/state.hpp(22,10): fatal error C1083: Cannot 
> open include file: 'csi/state.pb.h': No such file or directory (compiling 
> source file C:\Users\earthlab\mesos\src\slave\containerizer\composing.cpp) 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   launcher.cpp
> C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp(524,34): 
> error C2668: 'os::spawn': ambiguous call to overloaded function 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
> C:\Users\earthlab\mesos\3rdparty\stout\include\stout/os/exec.hpp(52,20): 
> message : could be 'Option os::spawn(const std::string &,const 
> std::vector> &)' 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   with
>   [
>   T=int
>   ] (compiling source file 
> C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp)
> C:\Users\earthlab\mesos\3rdparty\stout\include\stout/os/windows/exec.hpp(412,20):
>  message : or   'Option os::spawn(const std::string &,const 
> std::vector> &,const 
> Option,std::allocator  std::string,std::string &)' 
> [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
>   with
>   [
>   T=int
>   ] (compiling source file 
> C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp)
> C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp(525,75): 
> message : while trying to match the argument list '(const char [3], 
> initializer list)' [C:\Users\earthlab\mesos\build\src\mesos.vcxproj]
> C:\Users\earthlab\mesos\src\slave\containerizer\mesos\launch.cpp(893,47): 
> error C2668: 

[jira] [Commented] (MESOS-9950) memory cgroup gone before isolator cleaning up

2021-06-14 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17363126#comment-17363126
 ] 

Charles Natali commented on MESOS-9950:
---

[~subhajitpalit]

Is the agent started via systemd?

If yes, could you post the output of:


{noformat}
# systemctl show  | grep Delegate
{noformat}


 

 

 

> memory cgroup gone before isolator cleaning up
> --
>
> Key: MESOS-9950
> URL: https://issues.apache.org/jira/browse/MESOS-9950
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: longfei
>Priority: Major
>
> The memcg created by mesos may have been deleted before cgroup/memory 
> isolator cleaning up.
> This would let the termination fail and lose information in the old 
> termination(before fail). 
> {code:java}
> I0821 15:16:03.025796 3354800 paths.cpp:745] Creating sandbox 
> '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89'
>  for user 'tiger'
> I0821 15:16:03.026199 3354800 paths.cpp:748] Creating sandbox 
> '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/meta/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89'
> I0821 15:16:03.026304 3354800 slave.cpp:9064] Launching executor 
> 'mt:z03584687:1' of framework 
> 8e4967e5-736e-4a22-90c3-7b32d526914d- with resources 
> [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}]
>  in work directory 
> '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89'
> I0821 15:16:03.051795 3354800 slave.cpp:3520] Launching container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89 for executor 
> 'mt:z03584687:1' of framework 
> 8e4967e5-736e-4a22-90c3-7b32d526914d-
> I0821 15:16:03.076608 3354807 containerizer.cpp:1325] Starting container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.076911 3354807 containerizer.cpp:3185] Transitioning the state 
> of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from PROVISIONING to 
> PREPARING
> I0821 15:16:03.077906 3354802 memory.cpp:478] Started listening for OOM 
> events for container a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.079540 3354804 memory.cpp:198] Updated 
> 'memory.soft_limit_in_bytes' to 4032MB for container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.079587 3354820 cpu.cpp:92] Updated 'cpu.shares' to 1126 (cpus 
> 1.1) for container a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.079589 3354804 memory.cpp:227] Updated 'memory.limit_in_bytes' 
> to 4032MB for container a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.080901 3354802 switchboard.cpp:316] Container logger module 
> finished preparing container a0706ca0-fe2c-4477-8161-329b26ea5d89; 
> IOSwitchboard server is not required
> I0821 15:16:03.081593 3354801 linux_launcher.cpp:492] Launching container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89 and cloning with namespaces
> I0821 15:16:03.083823 3354808 containerizer.cpp:2107] Checkpointing 
> container's forked pid 1857418 to 
> '/opt/tiger/mesos_deploy_videoarch/mesos_zeus/slave/meta/slaves/fb5c1a5b-e106-47c1-9fe3-6ebd311b30ee-S628/frameworks/8e4967e5-736e-4a22-90c3-7b32d526914d-/executors/mt:z03584687:1/runs/a0706ca0-fe2c-4477-8161-329b26ea5d89/pids/forked.pid'
> I0821 15:16:03.084156 3354808 containerizer.cpp:3185] Transitioning the state 
> of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from PREPARING to ISOLATING
> I0821 15:16:03.091468 3354808 containerizer.cpp:3185] Transitioning the state 
> of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from ISOLATING to FETCHING
> I0821 15:16:03.094933 3354808 containerizer.cpp:3185] Transitioning the state 
> of container a0706ca0-fe2c-4477-8161-329b26ea5d89 from FETCHING to RUNNING
> I0821 15:16:03.197753 3354808 memory.cpp:198] Updated 
> 'memory.soft_limit_in_bytes' to 4032MB for container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:16:03.197757 3354801 cpu.cpp:92] Updated 'cpu.shares' to 1126 (cpus 
> 1.1) for container a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:21:39.692978 3354814 memory.cpp:515] OOM detected for container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89
> I0821 15:21:39.693182 3354805 containerizer.cpp:3044] Container 
> a0706ca0-fe2c-4477-8161-329b26ea5d89 has reached its limit for resource [] 
> 

[jira] [Comment Edited] (MESOS-10223) Test failures on Linux ARM64

2021-06-07 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358829#comment-17358829
 ] 

Charles Natali edited comment on MESOS-10223 at 6/7/21, 7:47 PM:
-

Ah, I mis-interpreted

 
{quote}If you don't have access to Linux ARM64 to reproduce it and to test 
potential fixes I've attached [^mesos-on-arm64.tgz]
{quote}
 

OK, I'll try to have a look at it sometime this week, but.
 
>From a quick look this part of the traceback for one of the failing tests is 
>interesting:

 
{noformat}
 @ 0xb0040544 __cxa_throw
 @ 0xaddee114 boost::throw_exception<>()
 @ 0xadec512c boost::conversion::detail::throw_bad_cast<>()
 @ 0xadec2228 boost::lexical_cast<>()
 @ 0xadebf89c numify<>()
 @ 0xadf71e3c proc::pids()
 @ 0xadf73594 os::pids()
 @ 0xadf73fb4 os::processes()
{noformat}
 

Looks like it's failing to parse PIDs under {{/proc}}.

However looking at the code - 
https://github.com/apache/mesos/blob/7841fcc848ebaac5de43cd4cccf1c243a3cdff56/3rdparty/stout/include/stout/numify.hpp#L49
 - {{boost::bad_lexical_cast}} should be handed, so it's a bit strange.

Looks like a problem during building/linking, might be missing some unwinder 
symbols.
Did you cross-compile this?
 


was (Author: cf.natali):
Ah, I mis-interpreted

 
{quote}If you don't have access to Linux ARM64 to reproduce it and to test 
potential fixes I've attached [^mesos-on-arm64.tgz]
{quote}
 

OK, I'll try to have a look at it sometime this week, but.
 
>From a quick look this part of the traceback for one of the failing tests is 
>interesting:

 
{noformat}
 @ 0xb0040544 __cxa_throw
 @ 0xaddee114 boost::throw_exception<>()
 @ 0xadec512c boost::conversion::detail::throw_bad_cast<>()
 @ 0xadec2228 boost::lexical_cast<>()
 @ 0xadebf89c numify<>()
 @ 0xadf71e3c proc::pids()
 @ 0xadf73594 os::pids()
 @ 0xadf73fb4 os::processes()
{noformat}
 

Looks like it's failing to parse PIDs under {{/proc}}.

However looking at the code 
[https://github.com/apache/mesos/blob/7841fcc848ebaac5de43cd4cccf1c243a3cdff56/3rdparty/stout/include/stout/numify.hpp#L49
  {{boost::bad_lexical_cast}} should be handed, so it's a bit strange.

Looks like a problem during building/linking, might be missing some unwinder 
symbols.
Did you cross-compile this?
 

> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 

[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64

2021-06-07 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358829#comment-17358829
 ] 

Charles Natali commented on MESOS-10223:


Ah, I mis-interpreted

 
{quote}If you don't have access to Linux ARM64 to reproduce it and to test 
potential fixes I've attached [^mesos-on-arm64.tgz]
{quote}
 

OK, I'll try to have a look at it sometime this week, but.
 
>From a quick look this part of the traceback for one of the failing tests is 
>interesting:

 
{noformat}
 @ 0xb0040544 __cxa_throw
 @ 0xaddee114 boost::throw_exception<>()
 @ 0xadec512c boost::conversion::detail::throw_bad_cast<>()
 @ 0xadec2228 boost::lexical_cast<>()
 @ 0xadebf89c numify<>()
 @ 0xadf71e3c proc::pids()
 @ 0xadf73594 os::pids()
 @ 0xadf73fb4 os::processes()
{noformat}
 

Looks like it's failing to parse PIDs under {{/proc}}.

However looking at the code 
[https://github.com/apache/mesos/blob/7841fcc848ebaac5de43cd4cccf1c243a3cdff56/3rdparty/stout/include/stout/numify.hpp#L49
  {{boost::bad_lexical_cast}} should be handed, so it's a bit strange.

Looks like a problem during building/linking, might be missing some unwinder 
symbols.
Did you cross-compile this?
 

> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ 

[jira] [Commented] (MESOS-10223) Test failures on Linux ARM64

2021-06-07 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358806#comment-17358806
 ] 

Charles Natali commented on MESOS-10223:


Hi [~mgrigorov]

Thanks!

Would it be possible for you to open a PR 
(https://github.com/apache/mesos/pulls) for the changes, would be easier to 
review than a {{tar}}.

Cheers,

> Test failures on Linux ARM64
> 
>
> Key: MESOS-10223
> URL: https://issues.apache.org/jira/browse/MESOS-10223
> Project: Mesos
>  Issue Type: Bug
>Reporter: Martin Tzvetanov Grigorov
>Priority: Major
> Attachments: mesos-on-arm64.tgz
>
>
> Running `make check` on Ubuntu 20.04.2 aarch64 fails with such errors:
>  
> {code:java}
>  [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.NumberFormat
> [   OK ] JsonTest.NumberFormat (0 ms)
> [ RUN  ] JsonTest.Find
> terminate called after throwing an instance of 
> 'boost::exception_detail::clone_impl
>  >'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090d) received by PID 2317 (TID 0xa80d9010) from 
> PID 2317; stack trace: ***
> @ 0xa80e77fc ([vdso]+0x7fb)
> @ 0xa7b71188 gsignal
> @ 0xa7b5ddac abort
> @ 0xa7d73848 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d715b0 __cxa_rethrow
> @ 0xa7d737e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0xa7d711ec (unknown)
> @ 0xa7d71250 std::terminate()
> @ 0xa7d71544 __cxa_throw
> @ 0xab4ee114 boost::throw_exception<>()
> @ 0xab5c512c boost::conversion::detail::throw_bad_cast<>()
> @ 0xab5c2228 boost::lexical_cast<>()
> @ 0xab5bf89c numify<>()
> @ 0xab5e00e8 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5e0584 JSON::Object::find<>()
> @ 0xab5cdd2c JsonTest_Find_Test::TestBody()
> @ 0xab886fec 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87f1d4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab85a9d0 testing::Test::Run()
> @ 0xab85b258 testing::TestInfo::Run()
> @ 0xab85b8d0 testing::TestCase::Run()
> @ 0xab862344 testing::internal::UnitTestImpl::RunAllTests()
> @ 0xab888440 
> testing::internal::HandleSehExceptionsInMethodIfSupported<>()
> @ 0xab87ffd4 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @ 0xab86100c testing::UnitTest::Run()
> @ 0xab630950 RUN_ALL_TESTS()
> @ 0xab630418 main
> @ 0xa7b5e110 __libc_start_main
> @ 0xab4b41d4 (unknown)
> [FAIL]: 8 shard(s) have failed tests
> make[6]: *** [Makefile:2092: check-local] Error 8
> make[6]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[5]: *** [Makefile:1840: check-am] Error 2
> make[5]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[4]: *** [Makefile:1685: check-recursive] Error 1
> make[4]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[3]: *** [Makefile:1842: check] Error 2
> make[3]: Leaving directory 
> '/home/ubuntu/git/apache/mesos/build/3rdparty/stout'
> make[2]: *** [Makefile:1153: check-recursive] Error 1
> make[2]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make[1]: *** [Makefile:1306: check] Error 2
> make[1]: Leaving directory '/home/ubuntu/git/apache/mesos/build/3rdparty'
> make: *** [Makefile:785: check-recursive] Error 1
> {code}
>  
> {code:java}
> [--] 3 tests from JsonTest
> [ RUN  ] JsonTest.InvalidUTF8
> [   OK ] JsonTest.InvalidUTF8 (0 ms)
> [ RUN  ] JsonTest.ParseError
> terminate called after throwing an instance of 'std::overflow_error'
> terminate called recursively
> *** Aborted at 1622796321 (unix time) try "date -d @1622796321" if you are 
> using GNU date ***
> PC: @0x0 (unknown)
> *** SIGABRT (@0x3e8090c) received by PID 2316 (TID 0x918cf010) from 
> PID 2316; stack trace: ***
> @ 0x918dd7fc ([vdso]+0x7fb)
> @ 0x91367188 gsignal
> @ 0x91353dac abort
> @ 0x91569848 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 0x915675b0 __cxa_rethrow
> @ 0x915697e4 __gnu_cxx::__verbose_terminate_handler()
> @ 0x915671ec (unknown)
> @ 0x91567250 std::terminate()
> @ 0x91567544 __cxa_throw
> 

[jira] [Commented] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses

2021-06-06 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358052#comment-17358052
 ] 

Charles Natali commented on MESOS-10222:


[https://github.com/apache/mesos/pull/393]

 
This with the previous PR allows to build with {{-Werror}} using:

{noformat}
gcc (Debian 10.2.1-6) 10.2.1 20210110
{noformat}


> Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
> ---
>
> Key: MESOS-10222
> URL: https://issues.apache.org/jira/browse/MESOS-10222
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
> Attachments: config.log
>
>
> I am trying to build Mesos master but it fails with:
>  
> {code:java}
>  In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38,
>  from 
> ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12,
>  from ../../3rdparty/stout/include/stout/uuid.hpp:21,
>  from ../../include/mesos/type_utils.hpp:36,
>  from ../../src/master/flags.cpp:18:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27,
>  from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30,
>  from ../../3rdparty/stout/include/stout/numify.hpp:19,
>  from ../../3rdparty/stout/include/stout/duration.hpp:29,
>  from ../../3rdparty/libprocess/include/process/time.hpp:18,
>  from ../../3rdparty/libprocess/include/process/clock.hpp:18,
>  from ../../3rdparty/libprocess/include/process/future.hpp:29,
>  from 
> ../../include/mesos/authentication/secret_generator.hpp:22,
>  from ../../src/local/local.cpp:24:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11,
>  from ../../include/mesos/resources.hpp:27,
>  from ../../src/master/master.hpp:31,
>  from ../../src/master/framework.cpp:17:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> 

[jira] [Commented] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses

2021-06-06 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17358045#comment-17358045
 ] 

Charles Natali commented on MESOS-10222:


Thanks.

 

I create a PR to fix some compilation warnings in picojson: 
[https://github.com/apache/mesos/pull/392]

 

I'll have a look at boost next.

> Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
> ---
>
> Key: MESOS-10222
> URL: https://issues.apache.org/jira/browse/MESOS-10222
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
> Attachments: config.log
>
>
> I am trying to build Mesos master but it fails with:
>  
> {code:java}
>  In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38,
>  from 
> ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12,
>  from ../../3rdparty/stout/include/stout/uuid.hpp:21,
>  from ../../include/mesos/type_utils.hpp:36,
>  from ../../src/master/flags.cpp:18:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27,
>  from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30,
>  from ../../3rdparty/stout/include/stout/numify.hpp:19,
>  from ../../3rdparty/stout/include/stout/duration.hpp:29,
>  from ../../3rdparty/libprocess/include/process/time.hpp:18,
>  from ../../3rdparty/libprocess/include/process/clock.hpp:18,
>  from ../../3rdparty/libprocess/include/process/future.hpp:29,
>  from 
> ../../include/mesos/authentication/secret_generator.hpp:22,
>  from ../../src/local/local.cpp:24:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11,
>  from ../../include/mesos/resources.hpp:27,
>  from ../../src/master/master.hpp:31,
>  from ../../src/master/framework.cpp:17:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>   

[jira] [Commented] (MESOS-10222) Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses

2021-06-03 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17356602#comment-17356602
 ] 

Charles Natali commented on MESOS-10222:


Yes I'm seeing a similar error on my Debian bullseye:

 
{noformat}
cf@thinkpad:~$ gcc --version
gcc (Debian 10.2.1-6) 10.2.1 20210110 
{noformat}
 

I'm also seeing warnings in one of our JSON libraries.

 

[~asekretenko]

It's next on my list to look at this, however assuming that the warnings have 
been fixed by upstream, will we want to:
 * update to the version fixing them
 * or cherry-pick individual fixes

 

More generally what's the policy for updating third-party dependencies?

> Build failure in 3rdparty/boost-1.65.0 with -Werror=parentheses
> ---
>
> Key: MESOS-10222
> URL: https://issues.apache.org/jira/browse/MESOS-10222
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Martin Tzvetanov Grigorov
>Priority: Minor
> Attachments: config.log
>
>
> I am trying to build Mesos master but it fails with:
>  
> {code:java}
>  In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from ../3rdparty/boost-1.65.0/boost/uuid/seed_rng.hpp:38,
>  from 
> ../3rdparty/boost-1.65.0/boost/uuid/random_generator.hpp:12,
>  from ../../3rdparty/stout/include/stout/uuid.hpp:21,
>  from ../../include/mesos/type_utils.hpp:36,
>  from ../../src/master/flags.cpp:18:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_facade.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/range/iterator_range_core.hpp:27,
>  from ../3rdparty/boost-1.65.0/boost/lexical_cast.hpp:30,
>  from ../../3rdparty/stout/include/stout/numify.hpp:19,
>  from ../../3rdparty/stout/include/stout/duration.hpp:29,
>  from ../../3rdparty/libprocess/include/process/time.hpp:18,
>  from ../../3rdparty/libprocess/include/process/clock.hpp:18,
>  from ../../3rdparty/libprocess/include/process/future.hpp:29,
>  from 
> ../../include/mesos/authentication/secret_generator.hpp:22,
>  from ../../src/local/local.cpp:24:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   | ^
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:193:21: error: unnecessary 
> parentheses in declaration of ‘assert_not_arg’ [-Werror=parentheses]
>   193 | failed  (boost::mpl::not_::
>   | ^
> In file included from 
> ../3rdparty/boost-1.65.0/boost/mpl/aux_/na_assert.hpp:23,
>  from ../3rdparty/boost-1.65.0/boost/mpl/arg.hpp:25,
>  from ../3rdparty/boost-1.65.0/boost/mpl/placeholders.hpp:24,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_categories.hpp:17,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/iterator_adaptor.hpp:14,
>  from 
> ../3rdparty/boost-1.65.0/boost/iterator/indirect_iterator.hpp:11,
>  from ../../include/mesos/resources.hpp:27,
>  from ../../src/master/master.hpp:31,
>  from ../../src/master/framework.cpp:17:
> ../3rdparty/boost-1.65.0/boost/mpl/assert.hpp:188:21: error: unnecessary 
> parentheses in declaration of ‘assert_arg’ [-Werror=parentheses]
>   188 | failed  (Pred::
>   |

[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values

2021-06-02 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17355896#comment-17355896
 ] 

Charles Natali commented on MESOS-10216:


Tough one, although I would err on the side of safety and not backport, since 
it's been present since basically forever and realistically only affects few 
users.

> Replicated log key encoding overflows into negative values
> --
>
> Key: MESOS-10216
> URL: https://issues.apache.org/jira/browse/MESOS-10216
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Affects Versions: 1.7.3, 1.8.1, 1.9.1, 1.11.0, 1.10.1, 1.12.0
>Reporter: Ilya
>Assignee: Charles Natali
>Priority: Major
> Fix For: 1.12.0
>
>
> LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions 
> encoded as strings and padded with zeroes up to a certain fixed size. The 
> {{encode()}} function is incorrect because it uses the {{%d}} formatter that 
> expects an {{int}}. It also limits the key size to 10 digits which is OK for 
> {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}.
> Because of this the available key range is reduced, and key overflow can 
> result in replica's {{METADATA}} record (position 0) being overwritten, which 
> in turn may cause data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10221) A large number of TASK_LOST causes the task to be unable to run

2021-05-30 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17354096#comment-17354096
 ] 

Charles Natali commented on MESOS-10221:


>  In addition, according to the framework running log, the accept information 
>is sent immediately after the offer is received, but the accept information in 
>the master log is far behind the send offer, so is it that the accept has not 
>been processed immediately, or is it that I have a wrong understanding of the 
>time of the send offer.

 

Yeah that looks suspicious, it'd be good to have the full logs of the master 
and framework so we can compare the timestamps of:
 * the offer being sent by the master
 * the offer being received by the framework
 * the accept being sent by the framework
 * the accept being received by the master

 

> A large number of TASK_LOST causes the task to be unable to run
> ---
>
> Key: MESOS-10221
> URL: https://issues.apache.org/jira/browse/MESOS-10221
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.9.0, 1.11.0
> Environment: Ubuntu 16.04
>Reporter: clancyhuang
>Priority: Major
>
> Recently, we found that the mesos master frequently generates Task lost 
> exceptions after task submission, and retrying in a short period of time is 
> not feasible, and it is becoming more and more frequent.
>  We selected two abnormal logs
> {code:java}
> I0528 15:09:55.367336   964 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:25.369561   969 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237
> I0528 15:10:43.383028   959 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:10:43.383656   959 master.cpp:5434] Processing DECLINE call for 
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 
> seconds filter
> I0528 15:10:03.385080   971 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:33.386322   972 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238
> I0528 15:10:57.181581   967 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> W0528 15:10:57.183194   967 master.cpp:3959] Ignoring accept of offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid
> W0528 15:10:57.183265   967 master.cpp:3964] ACCEPT call used invalid offers 
> '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid
> I0528 15:10:57.184392   967 master.cpp:8212] Sending status update TASK_LOST 
> for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: 
> Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid'
> {code}
> The following is a log of normal execution
> {code:java}
> I0528 15:17:03.690855   959 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529, 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13530 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:17:03.742848   970 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:17:03.745221   970 master.cpp:4356] Processing ACCEPT call for 
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529 ] on agent 
> cbe540a8-c894-4655-a899-cec7463d00c9-S2 at slave(1)@ip:5053 (ip) for 
> framework 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:17:03.745889   970 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13529
> {code}
> We found that the offer was cancelled before accept when the exception 
> occurred,and the interval time is just the configured offer-timeout. Our 
> framework communicates with mesos based on http, I am sure that he sends the 
> accept message immediately after receiving the offer and the request is 
> successful.
>  The question is why sometimes the master processes the accept message after 
> the offer times out. In addition, we tried to increase the offer-timeout, but 
> the problem was not resolved



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10221) A large number of TASK_LOST causes the task to be unable to run

2021-05-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353855#comment-17353855
 ] 

Charles Natali commented on MESOS-10221:


Hey [~934341445],

Receiving {{TASK_LOST}} upon a stale offer is perfectly fine and can occur in a 
normal and healthy cluster, and so should therefore be handled by the framework.

Here's an annotated log:
{noformat}
# the master sends an offer to the framework
I0528 15:10:03.385080   971 master.cpp:9579] Sending offers [ 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework 
24b62b35-26d6-4a13-ba75-
d84ce5fed64e-0005 (Test HTTP Framework)
# the master removes the offer: from that point, it is not valid, and any task 
submitted against it will be rejected with TASK_LOST
I0528 15:10:33.386322   972 master.cpp:11878] Removing offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238
# here the master receives an ACCEPT from the framework using this offer, which 
isn't valid anymore
I0528 15:10:57.181581   967 http.cpp:1436] HTTP POST for 
/master/api/v1/scheduler from 10.118.28.66:50484 with 
User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
# and therefore rejects it
W0528 15:10:57.183265   967 master.cpp:3964] ACCEPT call used invalid offers '[ 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer 
2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid
I0528 15:10:57.184392   967 master.cpp:8212] Sending status update TASK_LOST 
for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 of framework 
24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 'Task launched with invalid offers: 
Offer 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid'
{noformat}
 

However one thing which I notice from the above log is that there is a 24s gap 
between the master removing the offer (at 15:10:33.386322) and the framework 
trying to accept it (at 15:10:57.181581): normally, the master should have sent 
a {{RESCIND}} to the framework when the offer was removed (see 
[http://mesos.apache.org/documentation/latest/scheduler-http-api/#rescind]).
 Does your framework handle RESCIND? If not, this would make such rejections 
with {{TASK_LOST}} much more frequent than if it did.

Also, do you know what triggered the offer to be removed? One common cause is 
if an agent is disconnected for example, does that happen a lot in your cluster?
 What happened in this specific example, I'm surprised to not see more context 
in the log, did you filter out some lines?

> A large number of TASK_LOST causes the task to be unable to run
> ---
>
> Key: MESOS-10221
> URL: https://issues.apache.org/jira/browse/MESOS-10221
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.9.0, 1.11.0
> Environment: Ubuntu 16.04
>Reporter: clancyhuang
>Priority: Major
>
> Recently, we found that the mesos master frequently generates Task lost 
> exceptions after task submission, and retrying in a short period of time is 
> not feasible, and it is becoming more and more frequent.
>  We selected two abnormal logs
> {code:java}
> I0528 15:09:55.367336   964 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13236, 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:25.369561   969 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237
> I0528 15:10:43.383028   959 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> I0528 15:10:43.383656   959 master.cpp:5434] Processing DECLINE call for 
> offers: [ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13237 ] for framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework) with 5 
> seconds filter
> I0528 15:10:03.385080   971 master.cpp:9579] Sending offers [ 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ] to framework 
> 24b62b35-26d6-4a13-ba75-d84ce5fed64e-0005 (Test HTTP Framework)
> I0528 15:10:33.386322   972 master.cpp:11878] Removing offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238
> I0528 15:10:57.181581   967 http.cpp:1436] HTTP POST for 
> /master/api/v1/scheduler from 10.118.28.66:50484 with 
> User-Agent='Apache-HttpClient/4.5.12 (Java/1.8.0_272)'
> W0528 15:10:57.183194   967 master.cpp:3959] Ignoring accept of offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 since it is no longer valid
> W0528 15:10:57.183265   967 master.cpp:3964] ACCEPT call used invalid offers 
> '[ 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 ]': Offer 
> 2bf252e0-4d5a-4590-a696-0727c85be3bc-O13238 is no longer valid
> I0528 15:10:57.184392   967 master.cpp:8212] Sending status update TASK_LOST 
> for task data_rename-ebad5d27-df72-4106-96ab-ba6432befba9 

[jira] [Commented] (MESOS-10196) The task program runs successfully but the task status is failed

2021-05-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17353705#comment-17353705
 ] 

Charles Natali commented on MESOS-10196:


Thanks [~934341445] for confirming - I'll close this ticket then.

>  The task program runs successfully but the task status is failed
> -
>
> Key: MESOS-10196
> URL: https://issues.apache.org/jira/browse/MESOS-10196
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0, 1.10.0
> Environment: Ubuntu 16.04
> mesos master 1.10.0
> mesos slave 1.9.0
> python 3.7.3
>Reporter: clancyhuang
>Priority: Major
>
> When testing mesos to execute the task by default executor, I found that the 
> task status is failed but in fact the task was executed successfully.I tested 
> two shell scripts, one is very simple
> {code:sh}
> python -V > /root/test.txt
> {code}
> ,The other is a script about image processing.
>  I am sure they are all working properly, but I get an 
> error:REASON_EXECUTOR_TERMINATED.
>  The stderr of the task has no output, and the stdout is correct,the mesos 
> agent has such log output
> {code:bash}
> I1104 11:34:35.337236 35682 slave.cpp:3657] Launching container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef for executor 'default' of framework 
> d915071b-c275-4321-afd5-134b86ebadf3-0002
> I1104 11:34:35.337371 35685 containerizer.cpp:1396] Starting container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:34:35.337563 35685 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from STARTING to 
> PROVISIONING after 76800ns
> I1104 11:34:35.338893 35685 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PROVISIONING to 
> PREPARING after 1.321216ms
> I1104 11:34:35.340224 35703 switchboard.cpp:316] Container logger module 
> finished preparing container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef; 
> IOSwitchboard server is not required
> I1104 11:34:35.341944 35707 linux_launcher.cpp:492] Launching container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef and cloning with namespaces
> I1104 11:34:35.346983 35704 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PREPARING to ISOLATING 
> after 8.082944ms
> I1104 11:34:35.347719 35704 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from ISOLATING to FETCHING 
> after 730880ns
> I1104 11:34:35.348254 35737 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from FETCHING to RUNNING 
> after 539136ns
> I1104 11:34:58.060906 35680 slave.cpp:7406] Current disk usage 73.86%. Max 
> allowed age: 1.130070981247558days
> I1104 11:35:58.062266 35708 slave.cpp:7406] Current disk usage 73.86%. Max 
> allowed age: 1.129549109651991days
> I1104 11:36:58.062948 35741 slave.cpp:7406] Current disk usage 73.87%. Max 
> allowed age: 1.129005310066273days
> I1104 11:37:58.063513 35703 slave.cpp:7406] Current disk usage 73.88%. Max 
> allowed age: 1.128437717518472days
> I1104 11:38:30.242969 35740 containerizer.cpp:3161] Container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef has exited
> I1104 11:38:30.243052 35740 containerizer.cpp:2620] Destroying container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef in RUNNING state
> I1104 11:38:30.243072 35740 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from RUNNING to DESTROYING 
> after 3.9149140821mins
> I1104 11:38:30.243252 35672 linux_launcher.cpp:576] Asked to destroy 
> container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.243350 35672 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef'
> I1104 11:38:30.243768 35679 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.243961 35671 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after 
> 110848ns
> I1104 11:38:30.244160 35683 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.244272 35683 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after 
> 67840ns
> I1104 11:38:30.244668 35690 linux_launcher.cpp:650] Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef'
> I1104 11:38:30.245975 35726 slave.cpp:6856] Executor 'default' of framework 
> d915071b-c275-4321-afd5-134b86ebadf3-0002 exited with status 0
> I1104 11:38:30.246995 35726 slave.cpp:5737] Handling status update 
> TASK_FAILED (Status 

[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values

2021-05-14 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344916#comment-17344916
 ] 

Charles Natali commented on MESOS-10216:


OK I think the code in question is 
https://github.com/apache/mesos/blob/b8bfef6db158646df9fea6968bc75e88c32c3e21/src/log/leveldb.cpp#L101

The code indeed looks like it could suffer from overflow, however I'm not 
familiar with this part of the code base so I'll spend some time to understand 
exactly if it can be a problem in practice.

> Replicated log key encoding overflows into negative values
> --
>
> Key: MESOS-10216
> URL: https://issues.apache.org/jira/browse/MESOS-10216
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Affects Versions: 1.11.0
>Reporter: Ilya
>Priority: Major
>
> LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions 
> encoded as strings and padded with zeroes up to a certain fixed size. The 
> {{encode()}} function is incorrect because it uses the {{%d}} formatter that 
> expects an {{int}}. It also limits the key size to 10 digits which is OK for 
> {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}.
> Because of this the available key range is reduced, and key overflow can 
> result in replica's {{METADATA}} record (position 0) being overwritten, which 
> in turn may cause data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10218) Mesos slave fails to connect after enabling ssl

2021-05-14 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344912#comment-17344912
 ] 

Charles Natali commented on MESOS-10218:


OK, then maybe fine to close [~apeters]?

> Mesos slave fails to connect after enabling ssl
> ---
>
> Key: MESOS-10218
> URL: https://issues.apache.org/jira/browse/MESOS-10218
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.9.0
>Reporter: prasadkulkarni0711
>Priority: Major
>
> Mesos agent fails to connect to the master after setting the following 
> variables:
> LIBPROCESS_SSL_ENABLED=1
> LIBPROCESS_SSL_KEY_FILE=/etc/mesos/conf/ssl/server.key
> LIBPROCESS_SSL_CERT_FILE=/etc/mesos/conf/ssl/server.pem
> LIBPROCESS_SSL_REQUIRE_CERT=false
> LIBPROCESS_SSL_VERIFY_SERVER_CERT=false
> LIBPROCESS_SSL_REQUIRE_CLIENT_CERT=false
> LIBPROCESS_SSL_HOSTNAME_VALIDATION_SCHEME=openssl
> LIBPROCESS_SSL_VERIFY_CERT=false
> LIBPROCESS_SSL_CA_DIR=/etc/mesos/conf/ssl
> LIBPROCESS_SSL_CA_FILE=/etc/mesos/conf/ssl/ca.pem
> LIBPROCESS_SSL_SUPPORT_DOWNGRADE=false
> LIBPROCESS_SSL_VERIFY_IPADD=false
> #LIBPROCESS_SSL_ENABLE_TLS_V1_2=true
> Error in logs:
> Failed to accept socket: Failed accept: connection error: error:1407609C:SSL 
> routines:SSL23_GET_CLIENT_HELLO:http request
> Connectivity works after setting:
> LIBPROCESS_SSL_SUPPORT_DOWNGRADE=true
> But then the sandbox fails to open in the web UI:
> Potential reasons:
>  * The agent is not accessible
>  * The agent timed out or went offline
> With the following error in the logs:
> Failed to recv on socket 38 to peer 'unknown': Failed recv, connection error: 
> Connection reset by peer



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-9776) Mention removal of *.json endpoints in 1.8.0 CHANGELOG

2021-05-13 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-9776?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344173#comment-17344173
 ] 

Charles Natali commented on MESOS-9776:
---

Since 1.8.0 was released a while ago this can probably be closed now.

> Mention removal of *.json endpoints in 1.8.0 CHANGELOG
> --
>
> Key: MESOS-9776
> URL: https://issues.apache.org/jira/browse/MESOS-9776
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benno Evers
>Priority: Major
>
> We should mention in the CHANGELOG and update notes that the *.json that were 
> deprecated in Mesos 0.25 were actually removed in Mesos 1.8.0.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values

2021-05-13 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17344104#comment-17344104
 ] 

Charles Natali commented on MESOS-10216:


[~ipronin] Any chance you could point at the offending code?

> Replicated log key encoding overflows into negative values
> --
>
> Key: MESOS-10216
> URL: https://issues.apache.org/jira/browse/MESOS-10216
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Affects Versions: 1.11.0
>Reporter: Ilya
>Priority: Major
>
> LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions 
> encoded as strings and padded with zeroes up to a certain fixed size. The 
> {{encode()}} function is incorrect because it uses the {{%d}} formatter that 
> expects an {{int}}. It also limits the key size to 10 digits which is OK for 
> {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}.
> Because of this the available key range is reduced, and key overflow can 
> result in replica's {{METADATA}} record (position 0) being overwritten, which 
> in turn may cause data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce

2021-05-12 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343587#comment-17343587
 ] 

Charles Natali commented on MESOS-10220:


{{ldcache::parse}} is used in the rootfs code 
(https://github.com/apache/mesos/blob/96339efb53f7cdf1126ead7755d2b83b435e3263/src/tests/containerizer/rootfs.cpp#L123)
 gpu isolator 
(https://github.com/apache/mesos/blob/96339efb53f7cdf1126ead7755d2b83b435e3263/src/slave/containerizer/mesos/isolators/gpu/volume.cpp#L368)
 so would affect starting tasks but AFAICT not starting the master or agent.
In any case it should be easy to fix, I'll look at it.

> ldcache::parse failed to parse newer ld.so.cahce
> 
>
> Key: MESOS-10220
> URL: https://issues.apache.org/jira/browse/MESOS-10220
> Project: Mesos
>  Issue Type: Bug
>Reporter: Minh H.G.
>Assignee: Charles Natali
>Priority: Minor
>
> In glibc 2.31, the ld.so.cache file no longer support old format (the one 
> start with "ld.so-1.7.0")
> That cause ldcache::parse to fail and mesos cannot start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce

2021-05-12 Thread Charles Natali (Jira)


 [ 
https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Charles Natali reassigned MESOS-10220:
--

Assignee: Charles Natali

> ldcache::parse failed to parse newer ld.so.cahce
> 
>
> Key: MESOS-10220
> URL: https://issues.apache.org/jira/browse/MESOS-10220
> Project: Mesos
>  Issue Type: Bug
>Reporter: Minh H.G.
>Assignee: Charles Natali
>Priority: Minor
>
> In glibc 2.31, the ld.so.cache file no longer support old format (the one 
> start with "ld.so-1.7.0")
> That cause ldcache::parse to fail and mesos cannot start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce

2021-05-12 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343570#comment-17343570
 ] 

Charles Natali edited comment on MESOS-10220 at 5/12/21, 8:56 PM:
--

And also it'd be great if you could attach your {{ld.so.cache}}, it'll be 
easier to test.

Nevermind, the cache in the new format can be easily reproduced with {{ldconfig 
-c new}}.

However both the master and agent seem to start fine with it, so it'd be really 
helpful to have a log if they fail to start.


was (Author: cf.natali):
And also it'd be great if you could attach your {{ld.so.cache}}, it'll be 
easier to test.

> ldcache::parse failed to parse newer ld.so.cahce
> 
>
> Key: MESOS-10220
> URL: https://issues.apache.org/jira/browse/MESOS-10220
> Project: Mesos
>  Issue Type: Bug
>Reporter: Minh H.G.
>Priority: Minor
>
> In glibc 2.31, the ld.so.cache file no longer support old format (the one 
> start with "ld.so-1.7.0")
> That cause ldcache::parse to fail and mesos cannot start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce

2021-05-12 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343570#comment-17343570
 ] 

Charles Natali commented on MESOS-10220:


And also it'd be great if you could attach your {{ld.so.cache}}, it'll be 
easier to test.

> ldcache::parse failed to parse newer ld.so.cahce
> 
>
> Key: MESOS-10220
> URL: https://issues.apache.org/jira/browse/MESOS-10220
> Project: Mesos
>  Issue Type: Bug
>Reporter: Minh H.G.
>Priority: Minor
>
> In glibc 2.31, the ld.so.cache file no longer support old format (the one 
> start with "ld.so-1.7.0")
> That cause ldcache::parse to fail and mesos cannot start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10220) ldcache::parse failed to parse newer ld.so.cahce

2021-05-12 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17343555#comment-17343555
 ] 

Charles Natali commented on MESOS-10220:


Hey [~hgminh],

Thanks for the report - would it be possible to attach a log of the agent or 
master when they fail to start?

> ldcache::parse failed to parse newer ld.so.cahce
> 
>
> Key: MESOS-10220
> URL: https://issues.apache.org/jira/browse/MESOS-10220
> Project: Mesos
>  Issue Type: Bug
>Reporter: Minh H.G.
>Priority: Minor
>
> In glibc 2.31, the ld.so.cache file no longer support old format (the one 
> start with "ld.so-1.7.0")
> That cause ldcache::parse to fail and mesos cannot start.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10105) Make tests of builds with -fsanitize=address/memory/undefined/thread pass.

2021-04-15 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322254#comment-17322254
 ] 

Charles Natali commented on MESOS-10105:


Regarding {{-fsanitize=address}} , leak detection can be disabled with 
{{ASAN_OPTIONS=detect_leaks=0}}.
However a couple tests still fail:

{{
[  FAILED  ] HTTPCommandExecutorTest.TerminateWithACK
[  FAILED  ] PosixRLimitsIsolatorTest.UnsetLimits
[  FAILED  ] MesosContainerizer/DefaultExecutorTest.KillTask/0, where 
GetParam() = "mesos"
[  FAILED  ] 
MesosContainerizer/DefaultExecutorTest.CommitSuicideOnTaskFailure/0, where 
GetParam() = "mesos"
[  FAILED  ] MesosContainerizer/DefaultExecutorTest.CommitSuicideOnKillTask/0, 
where GetParam() = "mesos"
[  FAILED  ] MesosContainerizer/DefaultExecutorTest.MaxCompletionTime/0, where 
GetParam() = "mesos"
}}

All of them except {{PosixRLimitsIsolatorTest.UnsetLimits}} fail because they 
don't propagate the {{ASAN_OPTIONS}} environment variable.

> Make tests of builds with  -fsanitize=address/memory/undefined/thread pass.
> ---
>
> Key: MESOS-10105
> URL: https://issues.apache.org/jira/browse/MESOS-10105
> Project: Mesos
>  Issue Type: Wish
>Reporter: Andrei Sekretenko
>Priority: Critical
>
> As exemplified by various C++ projects and also by targeting specific issues 
> in Mesos (for example, MESOS-10102), running code built with clang sanitizers 
> helps with uncovering undefined behavior and data races.
> Sanitizer adoption usually happens as a sequence of steps which unblock each 
> other:
> 1) making local tests pass under sanitizer at least once
> 2) making CI regularly run sanitizer builds (so that new sanitizable bugs are 
> not introduced and more bugs not triggered deterministically are uncovered)
> 3) running high-level integration tests, betas, etc. with sanitizer builds
> --
> (3) is definitely out of scope of this wish, and it is not clear if (2) will 
> fit into ASF CI, but (1) is definitely doable, and on its own can lead to 
> figuring out causes of mysterious rare bugs (which might turn out to be not 
> so rare under certain conditions).
> --
> State of Mesos w.r.t sanitizers:
>  - as of Mar 2020, Mesos tests built with -fsanitize=address crash due to 
> several locations that leak one object per thread lifetime
>  - as of Nov 2019, libprocess tests were crashing thread sanitizer; IIRC, the 
> issues in libprocess on Linux/amd64 are also "technical", but probably could 
> result in a very real problems on a different platform



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10196) The task program runs successfully but the task status is failed

2021-04-14 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321176#comment-17321176
 ] 

Charles Natali commented on MESOS-10196:


Hey [~934341445] , sorry for the delay.

I know it's been a while, but in case it's still an issue, I think the next 
step would be to run the following command to see exactly what's going on - my 
guess is that the agent is maybe not starting the right executor or something 
like that:
{code}
strace -ttTf -p  -o agent.strace
{code}

And attach {{agent.strace}} together with the agent logs. It could also be 
useful to start the agent with {{GLOG_v=9}} environment variable to get 
detailed logs.

>  The task program runs successfully but the task status is failed
> -
>
> Key: MESOS-10196
> URL: https://issues.apache.org/jira/browse/MESOS-10196
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0, 1.10.0
> Environment: Ubuntu 16.04
> mesos master 1.10.0
> mesos slave 1.9.0
> python 3.7.3
>Reporter: clancyhuang
>Priority: Major
>
> When testing mesos to execute the task by default executor, I found that the 
> task status is failed but in fact the task was executed successfully.I tested 
> two shell scripts, one is very simple
> {code:sh}
> python -V > /root/test.txt
> {code}
> ,The other is a script about image processing.
>  I am sure they are all working properly, but I get an 
> error:REASON_EXECUTOR_TERMINATED.
>  The stderr of the task has no output, and the stdout is correct,the mesos 
> agent has such log output
> {code:bash}
> I1104 11:34:35.337236 35682 slave.cpp:3657] Launching container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef for executor 'default' of framework 
> d915071b-c275-4321-afd5-134b86ebadf3-0002
> I1104 11:34:35.337371 35685 containerizer.cpp:1396] Starting container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:34:35.337563 35685 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from STARTING to 
> PROVISIONING after 76800ns
> I1104 11:34:35.338893 35685 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PROVISIONING to 
> PREPARING after 1.321216ms
> I1104 11:34:35.340224 35703 switchboard.cpp:316] Container logger module 
> finished preparing container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef; 
> IOSwitchboard server is not required
> I1104 11:34:35.341944 35707 linux_launcher.cpp:492] Launching container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef and cloning with namespaces
> I1104 11:34:35.346983 35704 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PREPARING to ISOLATING 
> after 8.082944ms
> I1104 11:34:35.347719 35704 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from ISOLATING to FETCHING 
> after 730880ns
> I1104 11:34:35.348254 35737 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from FETCHING to RUNNING 
> after 539136ns
> I1104 11:34:58.060906 35680 slave.cpp:7406] Current disk usage 73.86%. Max 
> allowed age: 1.130070981247558days
> I1104 11:35:58.062266 35708 slave.cpp:7406] Current disk usage 73.86%. Max 
> allowed age: 1.129549109651991days
> I1104 11:36:58.062948 35741 slave.cpp:7406] Current disk usage 73.87%. Max 
> allowed age: 1.129005310066273days
> I1104 11:37:58.063513 35703 slave.cpp:7406] Current disk usage 73.88%. Max 
> allowed age: 1.128437717518472days
> I1104 11:38:30.242969 35740 containerizer.cpp:3161] Container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef has exited
> I1104 11:38:30.243052 35740 containerizer.cpp:2620] Destroying container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef in RUNNING state
> I1104 11:38:30.243072 35740 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from RUNNING to DESTROYING 
> after 3.9149140821mins
> I1104 11:38:30.243252 35672 linux_launcher.cpp:576] Asked to destroy 
> container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.243350 35672 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef'
> I1104 11:38:30.243768 35679 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.243961 35671 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after 
> 110848ns
> I1104 11:38:30.244160 35683 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.244272 35683 cgroups.cpp:1271] Successfully thawed cgroup 
> 

[jira] [Commented] (MESOS-10131) Agent frequently dies with error "Cycle found in mount table hierarchy"

2021-04-14 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321152#comment-17321152
 ] 

Charles Natali commented on MESOS-10131:


I think this could possibly happen without a loop in {{/proc/PID/mountinfo}} 
because reading from {{/proc/PID/mountinfo}} isn't atomic - definitely not if 
it can't be read in a single {{read}} syscall, which is very likely the case 
here since it's larger than 30K.

Could explain why it happens randomly especially if there are many short-lived 
tasks being started.

 

Since it didn't re-occur and the potential fix for it would be far from 
trivial, probably time to close.

 

> Agent frequently dies with error "Cycle found in mount table hierarchy"
> ---
>
> Key: MESOS-10131
> URL: https://issues.apache.org/jira/browse/MESOS-10131
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, framework
>Affects Versions: 1.9.0
>Reporter: Thomas Plummer
>Assignee: Andrei Budnik
>Priority: Major
> Attachments: log.txt
>
>
> Our mesos agent frequently dies with the follow error in the slave logs:
>  
> {code:java}
> F0509 22:10:33.036993 17723 fs.cpp:217] Check failed: 
> !visitedParents.contains(parentId) Cycle found in mount table hierarchy at 
> entry '1954': 
> 18 41 0:18 / /sys rw,nosuid,nodev,noexec,relatime shared:6 - sysfs sysfs 
> rw,seclabel
> 19 41 0:3 / /proc rw,nosuid,nodev,noexec,relatime shared:5 - proc proc rw
> 20 41 0:5 / /dev rw,nosuid shared:2 - devtmpfs devtmpfs 
> rw,seclabel,size=65852208k,nr_inodes=16463052,mode=755
> 21 18 0:17 / /sys/kernel/security rw,nosuid,nodev,noexec,relatime shared:7 - 
> securityfs securityfs rw
> 22 20 0:19 / /dev/shm rw,nosuid,nodev,noexec shared:3 - tmpfs tmpfs 
> rw,seclabel
> 23 20 0:12 / /dev/pts rw,nosuid,noexec,relatime shared:4 - devpts devpts 
> rw,seclabel,gid=5,mode=620,ptmxmode=000
> 24 41 0:20 / /run rw,nosuid,nodev shared:24 - tmpfs tmpfs rw,seclabel,mode=755
> 25 18 0:21 / /sys/fs/cgroup ro,nosuid,nodev,noexec shared:8 - tmpfs tmpfs 
> ro,seclabel,mode=755
> 26 25 0:22 / /sys/fs/cgroup/systemd rw,nosuid,nodev,noexec,relatime shared:9 
> - cgroup cgroup 
> rw,seclabel,xattr,release_agent=/usr/lib/systemd/systemd-cgroups-agent,name=systemd
> 27 18 0:23 / /sys/fs/pstore rw,nosuid,nodev,noexec,relatime shared:20 - 
> pstore pstore rw
> 28 18 0:24 / /sys/firmware/efi/efivars rw,nosuid,nodev,noexec,relatime 
> shared:21 - efivarfs efivarfs rw
> 29 25 0:25 / /sys/fs/cgroup/perf_event rw,nosuid,nodev,noexec,relatime 
> shared:10 - cgroup cgroup rw,seclabel,perf_event
> 30 25 0:26 / /sys/fs/cgroup/net_cls,net_prio rw,nosuid,nodev,noexec,relatime 
> shared:11 - cgroup cgroup rw,seclabel,net_prio,net_cls
> 31 25 0:27 / /sys/fs/cgroup/cpuset rw,nosuid,nodev,noexec,relatime shared:12 
> - cgroup cgroup rw,seclabel,cpuset
> 32 25 0:28 / /sys/fs/cgroup/blkio rw,nosuid,nodev,noexec,relatime shared:13 - 
> cgroup cgroup rw,seclabel,blkio
> 33 25 0:29 / /sys/fs/cgroup/freezer rw,nosuid,nodev,noexec,relatime shared:14 
> - cgroup cgroup rw,seclabel,freezer
> 34 25 0:30 / /sys/fs/cgroup/hugetlb rw,nosuid,nodev,noexec,relatime shared:15 
> - cgroup cgroup rw,seclabel,hugetlb
> 35 25 0:31 / /sys/fs/cgroup/devices rw,nosuid,nodev,noexec,relatime shared:16 
> - cgroup cgroup rw,seclabel,devices
> 36 25 0:32 / /sys/fs/cgroup/cpu,cpuacct rw,nosuid,nodev,noexec,relatime 
> shared:17 - cgroup cgroup rw,seclabel,cpuacct,cpu
> 37 25 0:33 / /sys/fs/cgroup/memory rw,nosuid,nodev,noexec,relatime shared:18 
> - cgroup cgroup rw,seclabel,memory
> 38 25 0:34 / /sys/fs/cgroup/pids rw,nosuid,nodev,noexec,relatime shared:19 - 
> cgroup cgroup rw,seclabel,pids
> 39 18 0:35 / /sys/kernel/config rw,relatime shared:22 - configfs configfs rw
> 41 0 253:0 / / rw,relatime shared:1 - xfs /dev/mapper/vg_system-root 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 42 18 0:16 / /sys/fs/selinux rw,relatime shared:23 - selinuxfs selinuxfs rw
> 43 19 0:37 / /proc/sys/fs/binfmt_misc rw,relatime shared:25 - autofs 
> systemd-1 
> rw,fd=32,pgrp=1,timeout=0,minproto=5,maxproto=5,direct,pipe_ino=11414
> 44 18 0:6 / /sys/kernel/debug rw,relatime shared:26 - debugfs debugfs rw
> 45 20 0:15 / /dev/mqueue rw,relatime shared:27 - mqueue mqueue rw,seclabel
> 46 20 0:38 / /dev/hugepages rw,relatime shared:28 - hugetlbfs hugetlbfs 
> rw,seclabel
> 47 41 8:2 / /boot rw,relatime shared:29 - xfs /dev/sda2 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 48 47 8:1 / /boot/efi rw,relatime shared:30 - vfat /dev/sda1 
> rw,fmask=0077,dmask=0077,codepage=437,iocharset=ascii,shortname=winnt,errors=remount-ro
> 49 41 253:2 / /var rw,relatime shared:31 - xfs /dev/mapper/vg_system-var 
> rw,seclabel,attr2,inode64,logbsize=256k,sunit=512,swidth=512,noquota
> 50 

[jira] [Commented] (MESOS-10216) Replicated log key encoding overflows into negative values

2021-04-14 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17321094#comment-17321094
 ] 

Charles Natali commented on MESOS-10216:


Yes that'd be really interesting, from memory there's a once in a blue moon bug 
involving leveldb corruption which could potentially be explained by this.

> Replicated log key encoding overflows into negative values
> --
>
> Key: MESOS-10216
> URL: https://issues.apache.org/jira/browse/MESOS-10216
> Project: Mesos
>  Issue Type: Bug
>  Components: replicated log
>Affects Versions: 1.11.0
>Reporter: Ilya
>Priority: Major
>
> LevelDB keys used by {{LevelDBStorage}} are {{uint64_t}} log positions 
> encoded as strings and padded with zeroes up to a certain fixed size. The 
> {{encode()}} function is incorrect because it uses the {{%d}} formatter that 
> expects an {{int}}. It also limits the key size to 10 digits which is OK for 
> {{UINT32_MAX}} but isn't enough for {{UINT64_MAX}}.
> Because of this the available key range is reduced, and key overflow can 
> result in replica's {{METADATA}} record (position 0) being overwritten, which 
> in turn may cause data loss.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10210) Bundled grpc doesn't compile with glibc 2.30+

2021-01-25 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271740#comment-17271740
 ] 

Charles Natali commented on MESOS-10210:


Merged by [~bmahler] so can be closed.

> Bundled grpc doesn't compile with glibc 2.30+
> -
>
> Key: MESOS-10210
> URL: https://issues.apache.org/jira/browse/MESOS-10210
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.11.0
>Reporter: Omer Ozarslan
>Priority: Minor
>
> Bundled grpc failed to link with glibc 2.31 since starting with 2.30 glibc 
> declares its own gettid function with same signature. Cherry picking two 
> commits from below two PRs from the upstream fixes the issue:
>  * [https://github.com/grpc/grpc/pull/20048]
>  * [https://github.com/grpc/grpc/pull/18950]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10146) Removing task from slave when framework is disconnected causes master to crash

2021-01-25 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10146?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17271739#comment-17271739
 ] 

Charles Natali commented on MESOS-10146:


Looking at the 1.9.0 code I think I found what caused this, however looking at 
master I believe it's been fixed by this commit: 
[https://github.com/apache/mesos/commit/6be17200b8084ad3524e7d450c411765b3214c0f]

for this issue: 
[https://issues.apache.org/jira/projects/MESOS/issues/MESOS-9609|https://issues.apache.org/jira/projects/MESOS/issues/MESOS-9609?filter=allissues]

 

So I think this can be closed as duplicate of #9609.

 

 

 

 

> Removing task from slave when framework is disconnected causes master to crash
> --
>
> Key: MESOS-10146
> URL: https://issues.apache.org/jira/browse/MESOS-10146
> Project: Mesos
>  Issue Type: Bug
>  Components: c++ api, framework
>Affects Versions: 1.9.0
> Environment: Mesos master with three master nodes
>Reporter: Naveen
>Priority: Blocker
>
> Hello, 
>     we want to report an issue we observed when remove tasks from slave. 
> There is condition to check for valid framework before tasks can be removed. 
> There can be several reasons framework can be disconnected. This check fails 
> and crashes mesos master node. 
> [https://github.com/apache/mesos/blob/1.9.0/src/master/master.cpp#L11842]
> There is also unguarded access to the internal framework state on line 11853.
> Error logs - 
> {noformat}
> mesos-master[5483]: I0618 14:05:20.859189 5491 master.cpp:9512] Marked agent 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 (10.160.73.79) unreachable: health 
> check timed out
> mesos-master[5483]: F0618 14:05:20.859347 5491 master.cpp:11842] Check 
> failed: framework != nullptr Framework 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067 not found while removing agent 
> 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303 at slave(1)@10.160.73.79:5051 
> (10.160.73.79); agent tasks: { 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-0067: { } 
> }
> mesos-master[5483]: *** Check failure stack trace: ***
> mesos-master[5483]: I0618 14:05:20.859781 5490 hierarchical.cpp:1013] Removed 
> all filters for agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
> mesos-master[5483]: I0618 14:05:20.872217 5490 hierarchical.cpp:890] Removed 
> agent 3c26f984-5adb-48f8-a656-3dfba1f9f0c1-S303
> mesos-master[5483]: I0618 14:05:20.859922 5487 replica.cpp:695] Replica 
> received learned notice for position 42070 from 
> log-network(1)@10.160.73.212:5050
> mesos-master[5483]: @ 0x7f2fdf6a5b1d google::LogMessage::Fail()
> mesos-master[5483]: @ 0x7f2fdf6a7dfd google::LogMessage::SendToLog()
> mesos-master[5483]: @ 0x7f2fdf6a56ab google::LogMessage::Flush()
> mesos-master[5483]: @ 0x7f2fdf6a8859 
> google::LogMessageFatal::~LogMessageFatal()
> mesos-master[5483]: @ 0x7f2fde2677f2 
> mesos::internal::master::Master::__removeSlave()
> mesos-master[5483]: @ 0x7f2fde267ebe 
> mesos::internal::master::Master::_markUnreachable()
> mesos-master[5483]: @ 0x7f2fde268215 
> _ZNO6lambda12CallableOnceIFN7process6FutureIbEEvEE10CallableFnINS_8internal7PartialIZN5mesos8internal6master6Master15markUnreachableERKNS9_9SlaveInfoEbRKSsEUlbE_JbclEv
> mesos-master[5483]: @ 0x7f2fddf30688 
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8internal8DispatchINS1_6FutureIbEEEclINS0_IFSC_vESC_RKNS1_4UPIDEOT_EUlSt10unique_ptrINS1_7PromiseIbEESt14default_deleteISO_EEOSG_S3_E_ISR_SG_St12_PlaceholderILi1EEclEOS3_
> mesos-master[5483]: @ 0x7f2fdf5e3b91 process::ProcessBase::consume()
> mesos-master[5483]: @ 0x7f2fdf608f77 process::ProcessManager::resume()
> mesos-master[5483]: @ 0x7f2fdf60cb36 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> mesos-master[5483]: @ 0x7f2fdf8c34d0 execute_native_thread_routine
> mesos-master[5483]: @ 0x7f2fdba02ea5 start_thread
> mesos-master[5483]: @ 0x7f2fdb20e8dd __clone
> systemd[1]: mesos-master.service: main process exited, code=killed, 
> status=6/ABRT
> systemd[1]: Unit mesos-master.service entered failed state.
> systemd[1]: mesos-master.service failed.
> systemd[1]: mesos-master.service holdoff time over, scheduling restart.
> systemd[1]: Stopped Mesos Master.
> systemd[1]: Started Mesos Master.
> mesos-master[28757]: I0618 14:05:41.461403 28748 logging.cpp:201] INFO level 
> logging started!
> mesos-master[28757]: I0618 14:05:41.461712 28748 main.cpp:243] Build: 
> 2020-05-09 10:42:00 by centos
> mesos-master[28757]: I0618 14:05:41.461721 28748 main.cpp:244] Version: 1.9.0
> mesos-master[28757]: I0618 14:05:41.461726 28748 main.cpp:247] Git tag: 1.9.0
> mesos-master[28757]: I0618 14:05:41.461730 28748 main.cpp:251] Git SHA: 
> 5e79a584e6ec3e9e2f96e8bf418411df9dafac2e{noformat}
>  



--

[jira] [Commented] (MESOS-10196) The task program runs successfully but the task status is failed

2020-11-18 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10196?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17235049#comment-17235049
 ] 

Charles Natali commented on MESOS-10196:


It's surprising there's no log of the executor registering and sending any task 
update - do you have the executor's log?

 

Also how do you start the tasks, do you use your own framework code?

>  The task program runs successfully but the task status is failed
> -
>
> Key: MESOS-10196
> URL: https://issues.apache.org/jira/browse/MESOS-10196
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.9.0, 1.10.0
> Environment: Ubuntu 16.04
> mesos master 1.10.0
> mesos slave 1.9.0
> python 3.7.3
>Reporter: clancyhuang
>Priority: Major
>
> When testing mesos to execute the task by default executor, I found that the 
> task status is failed but in fact the task was executed successfully.I tested 
> two shell scripts, one is very simple
> {code:sh}
> python -V > /root/test.txt
> {code}
> ,The other is a script about image processing.
>  I am sure they are all working properly, but I get an 
> error:REASON_EXECUTOR_TERMINATED.
>  The stderr of the task has no output, and the stdout is correct,the mesos 
> agent has such log output
> {code:bash}
> I1104 11:34:35.337236 35682 slave.cpp:3657] Launching container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef for executor 'default' of framework 
> d915071b-c275-4321-afd5-134b86ebadf3-0002
> I1104 11:34:35.337371 35685 containerizer.cpp:1396] Starting container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:34:35.337563 35685 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from STARTING to 
> PROVISIONING after 76800ns
> I1104 11:34:35.338893 35685 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PROVISIONING to 
> PREPARING after 1.321216ms
> I1104 11:34:35.340224 35703 switchboard.cpp:316] Container logger module 
> finished preparing container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef; 
> IOSwitchboard server is not required
> I1104 11:34:35.341944 35707 linux_launcher.cpp:492] Launching container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef and cloning with namespaces
> I1104 11:34:35.346983 35704 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from PREPARING to ISOLATING 
> after 8.082944ms
> I1104 11:34:35.347719 35704 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from ISOLATING to FETCHING 
> after 730880ns
> I1104 11:34:35.348254 35737 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from FETCHING to RUNNING 
> after 539136ns
> I1104 11:34:58.060906 35680 slave.cpp:7406] Current disk usage 73.86%. Max 
> allowed age: 1.130070981247558days
> I1104 11:35:58.062266 35708 slave.cpp:7406] Current disk usage 73.86%. Max 
> allowed age: 1.129549109651991days
> I1104 11:36:58.062948 35741 slave.cpp:7406] Current disk usage 73.87%. Max 
> allowed age: 1.129005310066273days
> I1104 11:37:58.063513 35703 slave.cpp:7406] Current disk usage 73.88%. Max 
> allowed age: 1.128437717518472days
> I1104 11:38:30.242969 35740 containerizer.cpp:3161] Container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef has exited
> I1104 11:38:30.243052 35740 containerizer.cpp:2620] Destroying container 
> 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef in RUNNING state
> I1104 11:38:30.243072 35740 containerizer.cpp:3323] Transitioning the state 
> of container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef from RUNNING to DESTROYING 
> after 3.9149140821mins
> I1104 11:38:30.243252 35672 linux_launcher.cpp:576] Asked to destroy 
> container 65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.243350 35672 linux_launcher.cpp:618] Destroying cgroup 
> '/sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef'
> I1104 11:38:30.243768 35679 cgroups.cpp:2854] Freezing cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.243961 35671 cgroups.cpp:1242] Successfully froze cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after 
> 110848ns
> I1104 11:38:30.244160 35683 cgroups.cpp:2872] Thawing cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef
> I1104 11:38:30.244272 35683 cgroups.cpp:1271] Successfully thawed cgroup 
> /sys/fs/cgroup/freezer/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef after 
> 67840ns
> I1104 11:38:30.244668 35690 linux_launcher.cpp:650] Destroying cgroup 
> '/sys/fs/cgroup/systemd/mesos/65c98d6f-fcf8-4be4-89a8-7fe53b5c30ef'
> I1104 11:38:30.245975 35726 slave.cpp:6856] Executor 'default' of framework 
> 

[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2020-05-06 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17101162#comment-17101162
 ] 

Charles Natali commented on MESOS-8038:
---

The more I think about it the more I think that the current behavior of 
optimistically releasing the resources is very sub-optimal.

We've had cgroup destruction fail for various reasons in our cluster:
 * kernel bugs - see https://issues.apache.org/jira/browse/MESOS-10107
 * tasks stuck in uninterruptible sleep, e.g. NFS I/O

When this happens, it triggers at least the following problems:
 * this issue with GPUs, which cause all subsequent tasks scheduled on the host 
trying to use the GPU to fail, effectively a black hole
 * another problem where some stacks stuck in uninterruptible sleep were still 
consuming memory, so the agent overcommitted memory causing tasks to run OOM 
further down the line

 

"Leaking" CPU is mostly fine because it's a compressible resource and stuck 
tasks generally don't use it, but it's pretty bad for memory and GPU, causing 
errors which are hard to diagnose and automatically recover from.

 

 

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, mesos_agent.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2020-04-23 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090969#comment-17090969
 ] 

Charles Natali commented on MESOS-8038:
---

See log attached ([^mesos_agent.log]).

See my interpretation below, keeping in mind that I'm not familiar with the 
code so might be completely wrong :).

Before the error occurs, we can see the following warning in the agent log:

{noformat}
W0423 22:46:19.277667 20524 containerizer.cpp:2428] Ignoring update for 
currently being destroyed container 6f446173-2bba-4cc4-bc15-c956bc159d4e
{noformat}
 

Looking at the logs, we can see that compared to a successful run, the 
containerizer's update method is called while the container is being destroyed.

Example of a successful task - the slave receives the task status update, 
forwards it, and then sends back the acknowledgement which cause sthe executor 
to exit, and the container to be destroyed, therefore after the task status 
update has been processed:
 
{noformat}
I0423 22:43:55.771867 20519 slave.cpp:5950] Handling status update 
TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task 
task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce- from executor(1)@127.0.1.1:35471
I0423 22:43:55.787933 20518 memory.cpp:287] Updated 
'memory.soft_limit_in_bytes' to 32MB for container 
5a7984a6-cecb-40ea-843c-8ed28cd92330
I0423 22:43:55.788524 20523 cpu.cpp:94] Updated 'cpu.shares' to 102 (cpus 0.1) 
for container 5a7984a6-cecb-40ea-843c-8ed28cd92330
I0423 22:43:55.794132 20522 task_status_update_manager.cpp:328] Received task 
status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) 
for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-
I0423 22:43:55.794495 20522 task_status_update_manager.cpp:383] Forwarding task 
status update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) 
for task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce- to the agent
I0423 22:43:55.795053 20522 slave.cpp:6496] Forwarding the update TASK_FINISHED 
(Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task 
task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce- to master@127.0.0.1:5050
I0423 22:43:55.812129 20522 slave.cpp:6380] Task status update manager 
successfully handled status update TASK_FINISHED (Status UUID: 
e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for task 
task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce-
I0423 22:43:55.813238 20522 slave.cpp:6407] Sending acknowledgement for status 
update TASK_FINISHED (Status UUID: e135ba0c-6dde-4cce-a1cd-42ea7ba86df5) for 
task task-0032d63f-91be-4522-bd03-ae0656e79d40 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce- to executor(1)@127.0.1.1:35471
[...]
I0423 22:43:57.005844 20521 slave.cpp:6676] Got exited event for 
executor(1)@127.0.1.1:35471
I0423 22:43:57.205157 20522 containerizer.cpp:3159] Container 
5a7984a6-cecb-40ea-843c-8ed28cd92330 has exited
I0423 22:43:57.205278 20522 containerizer.cpp:2623] Destroying container 
5a7984a6-cecb-40ea-843c-8ed28cd92330 in RUNNING state
I0423 22:43:57.205379 20522 containerizer.cpp:3321] Transitioning the state of 
container 5a7984a6-cecb-40ea-843c-8ed28cd92330 from RUNNING to DESTROYING after 
4.612041984secs
I0423 22:43:57.206100 20523 linux_launcher.cpp:564] Asked to destroy container 
5a7984a6-cecb-40ea-843c-8ed28cd92330
{noformat}
 

Now let's look at what happens when the task right before the task which fails 
with "Requested 1 gpus but only 0 available" finishes:
 
{noformat}
I0423 22:46:16.506460 20519 slave.cpp:5950] Handling status update 
TASK_FINISHED (Status UUID: 4b7a01c5-15af-47a3-b06b-5ed8f7d65405) for task 
task-650af3bd-3f5b-4e17-9d34-4642480b4da0 of framework 
0142aec2-d0c1-4011-8340-d81107d40fce- from executor(1)@127.0.1.1:36541
I0423 22:46:17.560580 20521 slave.cpp:6676] Got exited event for 
executor(1)@127.0.1.1:36541
I0423 22:46:18.701063 20523 linux_launcher.cpp:638] Destroying cgroup 
'/sys/fs/cgroup/systemd/mesos/8a4e52e5-eab6-43e7-8bd1-9b9248614e69'
I0423 22:46:19.236407 20525 slave.cpp:7076] Executor 
'task-376cdda6-760b-4d3b-ad7f-2d86916695a3' of framework 
0142aec2-d0c1-4011-8340-d81107d40fce- exited with status 0
I0423 22:46:19.237376 20525 slave.cpp:7187] Cleaning up executor 
'task-376cdda6-760b-4d3b-ad7f-2d86916695a3' of framework 
0142aec2-d0c1-4011-8340-d81107d40fce- at executor(1)@127.0.1.1:41227
I0423 22:46:19.241185 20522 gc.cpp:95] Scheduling 
'/tmp/mesos_agent/work/slaves/0142aec2-d0c1-4011-8340-d81107d40fce-S0/frameworks/0142aec2-d0c1-4011-8340-d81107d40fce-/executors/task-376cdda6-760b-4d3b-ad7f-2d86916695a3/runs/0c718138-d6f3-42d4-9de7-4dac7d518dc5'
 for
 gc 9.9959984512mins in the future
I0423 

[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.

2020-04-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005
 ] 

Charles Natali edited comment on MESOS-8038 at 4/21/20, 7:46 PM:
-

[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" 
(see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After less than a minute, tasks start failing with this error:
{noformat}
Failed to launch container: Requested 1 gpus but only 0 available{noformat}
 

I'll try to see if I can find a simpler reproducer, but this seems to fail 
systematically for me.

 


was (Author: cf.natali):
[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" 
(see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After a few minutes, tasks start failing with this error:
{noformat}
Failed to launch container: Requested 1 gpus but only 0 available{noformat}
 

I'll try to see if I can find a simpler reproducer, but this to fail 
systematically for me.

 

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (MESOS-8038) Launching GPU task sporadically fails.

2020-04-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005
 ] 

Charles Natali edited comment on MESOS-8038 at 4/21/20, 7:32 PM:
-

[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks allocating 1 GPU and just do "exit 0" 
(see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After a few minutes, tasks start failing with this error:
{noformat}
Failed to launch container: Requested 1 gpus but only 0 available{noformat}
 

I'll try to see if I can find a simpler reproducer, but this to fail 
systematically for me.

 


was (Author: cf.natali):
[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks while allocate 1 GPU and just do 
"exit 0" (see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After a few minutes, tasks start failing with this error:

Failed to launch container: Requested 1 gpus but only 0 available

 

I'll try to see if I can find a simpler reproducer, but this to fail 
systematically for me.

 

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-8038) Launching GPU task sporadically fails.

2020-04-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-8038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17089005#comment-17089005
 ] 

Charles Natali commented on MESOS-8038:
---

[~bmahler]

I have a way to reproduce it systematically, albeit very contrived: using 
syscall fault injection.

 

Basically I just continuously start tasks while allocate 1 GPU and just do 
"exit 0" (see attach python framework).

 

Then, I run the following  - inject a few seconds delay in all rmdir syscalls 
made by the agent:

 
{noformat}
# strace -p $(pgrep -f mesos-agent) -f -e inject=rmdir:delay_enter=300 -o 
/dev/null
{noformat}
 

After a few minutes, tasks start failing with this error:

Failed to launch container: Requested 1 gpus but only 0 available

 

I'll try to see if I can find a simpler reproducer, but this to fail 
systematically for me.

 

> Launching GPU task sporadically fails.
> --
>
> Key: MESOS-8038
> URL: https://issues.apache.org/jira/browse/MESOS-8038
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, gpu
>Affects Versions: 1.4.0
>Reporter: Sai Teja Ranuva
>Assignee: Zhitao Li
>Priority: Critical
> Attachments: mesos-master.log, mesos-slave-with-issue-uber.txt, 
> mesos-slave.INFO.log, start_short_tasks_gpu.py
>
>
> I was running a job which uses GPUs. It runs fine most of the time. 
> But occasionally I see the following message in the mesos log.
> "Collect failed: Requested 1 but only 0 available"
> Followed by executor getting killed and the tasks getting lost. This happens 
> even before the the job starts. A little search in the code base points me to 
> something related to GPU resource being the probable cause.
> There is no deterministic way that this can be reproduced. It happens 
> occasionally.
> I have attached the slave log for the issue.
> Using 1.4.0 Mesos Master and 1.4.0 Mesos Slave.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU

2020-04-21 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10119?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17088973#comment-17088973
 ] 

Charles Natali commented on MESOS-10119:


So for the good news: I couldn't reproduce it - it turned out to be a bug in 
one of our legacy systems which caused it to remove the agent's cgroups...

 

However I did observe this particular failure as a consequence of the now fixed 
https://issues.apache.org/jira/browse/MESOS-10107

 

> Marking as a duplicate of MESOS-8038.

 

Ah, let's close this one then.

 

> failure to destroy container can cause the agent to "leak" a GPU
> 
>
> Key: MESOS-10119
> URL: https://issues.apache.org/jira/browse/MESOS-10119
> Project: Mesos
>  Issue Type: Task
>  Components: agent, containerization
>Reporter: Charles Natali
>Priority: Major
>
> At work we hit the following problem:
>  # cgroup for a task using the GPU isolation failed to be destroyed on OOM
>  # the agent continued advertising the GPU as available
>  # all subsequent attempts to start tasks using a GPU fails with "Requested 1 
> gpus but only 0 available"
> Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can 
> be tackled separately, however the fact that the agent basically leaks the 
> GPU is pretty bad, because it basically turns into /dev/null, failing all 
> subsequent tasks requesting a GPU.
>  
> See the logs:
>  
>  
> {noformat}
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 
> memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 
> memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
> directory
> Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 
> memory.cpp:686] Failed to read 'memory.stat': No such file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 
> slave.cpp:6994] Termination of executor 
> 'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an 
> isolator when destroying container: Failed to destroy cgroups: Failed to get 
> nested cgroups: Failed to determine canonical path of 
> '/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such 
> file or directory
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 
> containerizer.cpp:2567] Skipping status for container 
> 8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 8ef00748-b640-4620-97dc-f719e9775e88
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 
> slave.cpp:6994] Termination of executor 
> 'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all 
> processes in the container: Failed to remove cgroup 
> 'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup 
> '/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device 
> or resource busy
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 
> containerizer.cpp:2567] Skipping status for container 
> 5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist
> Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 
> containerizer.cpp:2428] Ignoring update for currently being destroyed 
> container 5c1418f0-1d4d-47cd-a188-0f4b87e394f2
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 
> slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor 
> 'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework 
> c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus 
> but only 0 available
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 
> memory.cpp:637] Listening on OOM events failed for container 
> 87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating
> Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 
> containerizer.cpp:2421] Ignoring 

[jira] [Created] (MESOS-10119) failure to destroy container can cause the agent to "leak" a GPU

2020-04-18 Thread Charles Natali (Jira)
Charles Natali created MESOS-10119:
--

 Summary: failure to destroy container can cause the agent to 
"leak" a GPU
 Key: MESOS-10119
 URL: https://issues.apache.org/jira/browse/MESOS-10119
 Project: Mesos
  Issue Type: Task
  Components: agent, containerization
Reporter: Charles Natali


At work we hit the following problem:
 # cgroup for a task using the GPU isolation failed to be destroy on OOM
 # the agent continued advertising the GPU as available
 # all subsequent attempts to start tasks using a GPU fails with "Requested 1 
gpus but only 0 available"

Problem 1 looks like https://issues.apache.org/jira/browse/MESOS-9950) so can 
be tackled separately, however the fact that the agent basically leaks the GPU 
is pretty bad, because it basically turns into /dev/null, failing all 
subsequent tasks requesting a gpu

 

See the logs:

 

 
{noformat}
Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874277 2138 
memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
directory
Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874305 2138 
memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
directory
Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874315 2138 
memory.cpp:686] Failed to read 'memory.stat': No such file or directory
Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874701 2136 
memory.cpp:665] Failed to read 'memory.limit_in_bytes': No such file or 
directory
Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874734 2136 
memory.cpp:674] Failed to read 'memory.max_usage_in_bytes': No such file or 
directory
Apr 17 17:00:03 engpuc006 mesos-slave[2068]: E0417 17:00:03.874747 2136 
memory.cpp:686] Failed to read 'memory.stat': No such file or directory
Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.062358 2152 
slave.cpp:6994] Termination of executor 
'task_0:067b0963-134f-a917-4503-89b6a2a630ac' of framework 
c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to clean up an 
isolator when destroying container: Failed to destroy cgroups: Failed to get 
nested cgroups: Failed to determine canonical path of 
'/sys/fs/cgroup/memory/mesos/8ef00748-b640-4620-97dc-f719e9775e88': No such 
file or directory
Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063295 2150 
containerizer.cpp:2567] Skipping status for container 
8ef00748-b640-4620-97dc-f719e9775e88 because: Container does not exist
Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.063429 2137 
containerizer.cpp:2428] Ignoring update for currently being destroyed container 
8ef00748-b640-4620-97dc-f719e9775e88
Apr 17 17:00:05 engpuc006 mesos-slave[2068]: E0417 17:00:05.079169 2150 
slave.cpp:6994] Termination of executor 
'task_1:a00165a1-123a-db09-6b1a-b6c4054b0acd' of framework 
c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed: Failed to kill all processes 
in the container: Failed to remove cgroup 
'mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Failed to remove cgroup 
'/sys/fs/cgroup/freezer/mesos/5c1418f0-1d4d-47cd-a188-0f4b87e394f2': Device or 
resource busy
Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079537 2140 
containerizer.cpp:2567] Skipping status for container 
5c1418f0-1d4d-47cd-a188-0f4b87e394f2 because: Container does not exist
Apr 17 17:00:05 engpuc006 mesos-slave[2068]: W0417 17:00:05.079670 2136 
containerizer.cpp:2428] Ignoring update for currently being destroyed container 
5c1418f0-1d4d-47cd-a188-0f4b87e394f2
Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.956969 2136 
slave.cpp:6889] Container '87253521-8d39-47ea-b4d1-febe527d230c' for executor 
'task_2:8b129d24-70d2-2cab-b2df-c73911954ec3' of framework 
c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus but 
only 0 available
Apr 17 17:00:07 engpuc006 mesos-slave[2068]: E0417 17:00:07.957670 2149 
memory.cpp:637] Listening on OOM events failed for container 
87253521-8d39-47ea-b4d1-febe527d230c: Event listener is terminating
Apr 17 17:00:07 engpuc006 mesos-slave[2068]: W0417 17:00:07.966552 2150 
containerizer.cpp:2421] Ignoring update for unknown container 
87253521-8d39-47ea-b4d1-febe527d230c
Apr 17 17:00:08 engpuc006 mesos-slave[2068]: W0417 17:00:08.109067 2154 
process.cpp:1480] Failed to link to '172.16.22.201:34059', connect: Failed 
connect: connection closed
Apr 17 17:00:10 engpuc006 mesos-slave[2068]: E0417 17:00:10.310817 2141 
slave.cpp:6889] Container '257b45f1-8582-4cb5-8138-454e9697bfe4' for executor 
'task_3:6bdd99ca-7a2b-f19c-bbb3-d9478fe8f81e' of framework 
c0c4ce82-5cff-4116-aacb-c3fd6a93d61b- failed to start: Requested 1 gpus but 
only 0 available
Apr 17 17:00:10 engpuc006 mesos-slave[2068]: E0417 17:00:10.311614 2141 
memory.cpp:637] Listening on OOM events failed for container 
257b45f1-8582-4cb5-8138-454e9697bfe4: Event listener is terminating

[jira] [Commented] (MESOS-10110) Libprocess ignores most protobuf (de)serialisation failure cases.

2020-04-07 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17077633#comment-17077633
 ] 

Charles Natali commented on MESOS-10110:


Hey.

 

It's probably my fault - I just created another account - this one, "cf.natali".

That's because my previous account "charle" obviously contained a typo and also 
didn't match the  username I used for the [https://reviews.apache.org/] and 
apparently Jira doesn't support chaning usernames.

 

Hope I didn't make too much of a mess!

> Libprocess ignores most protobuf (de)serialisation failure cases.
> -
>
> Key: MESOS-10110
> URL: https://issues.apache.org/jira/browse/MESOS-10110
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Charles
>Priority: Major
>
> Before the code didn't check at all the return value of
>  {{Message::SerializeToString}}, which can fail for various reasons,
>  e.g. out-of-memory, message too large, or invalid UTF-8 string.
>  Also, the way deserialisation was checked for error using
>  {{Message::IsInitialized}} doesn't detect errors such as the above,
>  we need to check {{Message::ParseFromString}} return value.
> {{}}
> We noticed this at work because our custom executor had a bug causing it to 
> send invalid/non-UTF8 {{mesos.TaskID}}, but it was successfully serialised by 
> the executor (driver), and deserialised by the framework, which was blowing 
> it to blow up at later point far from the original source of the problem.
> More generally we want to catch such invalid messages - which can happen for 
> a variety of reasons - as early as possible.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)