[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-08-02 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17391736#comment-17391736
 ] 

Charles Natali commented on MESOS-10226:


Hm, it's annoying - the gdb backtrace you posted shows that the regtest gets 
stuck in this test, but for some reason  running this test on its own isn't 
enough to reproduce it.
It's going to be very difficult to debug without being able to run them myself.

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, 
> gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 host - the test will just hang reading 
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if 
> successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Martin Tzvetanov Grigorov (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390279#comment-17390279
 ] 

Martin Tzvetanov Grigorov commented on MESOS-10226:
---

`sudo ./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* 
--verbose` didn't hang but failed with:

 
{code:java}
...
7-b49a-765ac4cd1729/backends/overlay/rootfses/ba3ccd6c-bacf-4d88-a4fc-5104ca45d19e'
 for container a6a3e1b5-4322-4b07-b49a-765ac4cd1729
I0730 05:32:29.134213 2249744 master.cpp:1149] Master terminating
I0730 05:32:29.134589 2249739 hierarchical.cpp:1232] Removed all filters for 
agent 4c43d934-41d8-4159-9b03-2dfdeee3f386-S0
I0730 05:32:29.134629 2249739 hierarchical.cpp:1108] Removed agent 
4c43d934-41d8-4159-9b03-2dfdeee3f386-S0
[  FAILED  ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/4, where 
GetParam() = "quay.io/coreos/alpine-sh" (3751 ms)
[--] 5 tests from ContainerImage/ProvisionerDockerTest (38953 ms 
total)[--] Global test environment tear-down
[==] 5 tests from 1 test case ran. (38966 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 5 tests, listed below:
[  FAILED  ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where 
GetParam() = "alpine"
[  FAILED  ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/1, where 
GetParam() = "library/alpine"
[  FAILED  ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/2, where 
GetParam() = "gcr.io/google-containers/busybox:1.24"
[  FAILED  ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/3, where 
GetParam() = "gcr.io/google-containers/busybox:1.27"
[  FAILED  ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/4, where 
GetParam() = "quay.io/coreos/alpine-sh" 5 FAILED TESTS
I0730 05:32:29.176168 2249746 process.cpp:935] Stopped the socket accept loop
 {code}
 

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, 
> gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48

[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390152#comment-17390152
 ] 

Charles Natali commented on MESOS-10226:


Hm, I can't reproduce it.

I updated the test to run the arm64 alpine image to cause it to fail in a 
similar way that it should be failing for you, and it's not hanging, but 
failing:

```

# ./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand*

[ RUN ] ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0
sh: 1: hadoop: not found
Marked '/' as rslave
I0729 21:40:16.121507 434157 exec.cpp:164] Version: 1.12.0
I0729 21:40:16.136072 434156 exec.cpp:237] Executor registered on agent 
48863f87-f283-42ab-bd93-f301fdfbd73b-S0
I0729 21:40:16.139089 434154 executor.cpp:190] Received SUBSCRIBED event
I0729 21:40:16.139974 434154 executor.cpp:194] Subscribed executor on thinkpad
I0729 21:40:16.140264 434154 executor.cpp:190] Received LAUNCH event
I0729 21:40:16.141703 434154 executor.cpp:722] Starting task 
1461a266-1ead-4bdf-9165-9c0f6c5938b8
I0729 21:40:16.147071 434154 executor.cpp:740] Forked command at 434163
Preparing rootfs at 
'/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634'
Changing root to 
/tmp/ContainerImage_ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_0_GxQGxF/provisioner/containers/77c499a5-6d34-46aa-86a4-e993d53aa56a/backends/overlay/rootfses/629e6501-86d4-447e-bf17-412cd1cb6634
Failed to execute '/bin/ls': Exec format error
I0729 21:40:16.321754 434155 executor.cpp:1041] Command exited with status 1 
(pid: 434163)
../../src/tests/containerizer/provisioner_docker_tests.cpp:785: Failure
 Expected: TASK_FINISHED
To be equal to: statusFinished->state()
 Which is: TASK_FAILED
I0729 21:40:16.333557 434157 exec.cpp:478] Executor asked to shutdown
I0729 21:40:16.334996 434158 executor.cpp:190] Received SHUTDOWN event
I0729 21:40:16.335037 434158 executor.cpp:843] Shutting down
[ FAILED ] 
ContainerImage/ProvisionerDockerTest.ROOT_INTERNET_CURL_SimpleCommand/0, where 
GetParam() = "arm64v8/alpine" (5851 ms)

```

 

Could you try running

```

./bin/mesos-tests.sh 
--gtest_filter=*ProvisionerDockerTest.*ROOT_INTERNET_CURL_SimpleCommand* 
--verbose

```

 

And see if it hangs, and post the result?

 

Worst case we could just ignore the hang and update the test to use the arn64 
image so it passes, but I'd like to understand why it hangs.

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, 
> gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 

[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Martin Tzvetanov Grigorov (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390106#comment-17390106
 ] 

Martin Tzvetanov Grigorov commented on MESOS-10226:
---

Hi [~cf.natali] !

It still hangs since 6 hours ago.

This is the new thread dump - [^gdb-thread-apply-bt-all-29.07.2021-2.txt]

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021-2.txt, 
> gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 host - the test will just hang reading 
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if 
> successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390058#comment-17390058
 ] 

Charles Natali commented on MESOS-10226:


[~mgrigorov] Looking at the code corresponding to the backtrace, I don't think 
it should hang foreverm but only up to 10 minutes:

 
{noformat}
#13 0xb7ca1418 in AwaitAssertReady 
(expr=0xba1c1d58 "statusStarting", actual=..., duration=...) at 
../../3rdparty/libprocess/include/process/gtest.hpp:126
#14 0xb97c588c in 
mesos::internal::tests::ProvisionerDockerTest_ROOT_INTERNET_CURL_SimpleCommand_Test::TestBody
 (this=0xcd4207a0) at 
../../src/tests/containerizer/provisioner_docker_tests.cpp:782
{noformat}
 

 
{noformat}
 
AWAIT_READY_FOR(statusStarting, Minutes(10));{noformat}
 

Are you sure it was stuck indefinitely and not just taking a long time?

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 

[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Charles Natali (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390055#comment-17390055
 ] 

Charles Natali commented on MESOS-10226:


Thanks, I'll have a look - I hope there won't be too many hanging tests...

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 host - the test will just hang reading 
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if 
> successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Martin Tzvetanov Grigorov (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389882#comment-17389882
 ] 

Martin Tzvetanov Grigorov commented on MESOS-10226:
---

Attached [^gdb-thread-apply-bt-all-29.07.2021.txt]

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
> Attachments: gdb-thread-apply-bt-all-29.07.2021.txt
>
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 host - the test will just hang reading 
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if 
> successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Martin Tzvetanov Grigorov (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389867#comment-17389867
 ] 

Martin Tzvetanov Grigorov commented on MESOS-10226:
---

The test properly fails now:
 
{color:#500050}[--] Global test environment tear-down
{color}[==] 34 tests from 2 test cases ran. (66593 ms total)
[  PASSED  ] 33 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] 
NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
 
But 'sudo make check` still hangs, probably on a different test this time.
I am trying to get the backtraces with gdb but gdb also hangs ...
I'll send you the new info once I have it!
 

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 host - the test will just hang reading 
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if 
> successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (MESOS-10226) test suite hangs on ARM64

2021-07-29 Thread Martin Tzvetanov Grigorov (Jira)


[ 
https://issues.apache.org/jira/browse/MESOS-10226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17389717#comment-17389717
 ] 

Martin Tzvetanov Grigorov commented on MESOS-10226:
---

Hi Charles,

 

I could reproduce the issue easily with `sudo 
{color:#22}./bin/mesos-tests.sh 
--gtest_filter=*NestedMesosCon{color}{color:#22}tainerizerTest*{color}`.

Now I am re-building Mesos with your patch!

I update this ticket in half an hour or so!

> test suite hangs on ARM64
> -
>
> Key: MESOS-10226
> URL: https://issues.apache.org/jira/browse/MESOS-10226
> Project: Mesos
>  Issue Type: Bug
>Reporter: Charles Natali
>Assignee: Charles Natali
>Priority: Major
>
> Reported by [~mgrigorov].
>  
> {noformat}
> [ RUN      ] 
> NestedMesosContainerizerTest.ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace
> sh: 1: hadoop: not found
> Marked '/' as rslave
> I0726 11:59:17.812630    32 exec.cpp:164] Version: 1.12.0
> I0726 11:59:17.827512    31 exec.cpp:237] Executor registered on agent 
> 9076f44b-846d-4f00-a2dc-11f694cc1900-S0
> I0726 11:59:17.830999    36 executor.cpp:190] Received SUBSCRIBED event
> I0726 11:59:17.832351    36 executor.cpp:194] Subscribed executor on 
> martin-arm64
> I0726 11:59:17.832775    36 executor.cpp:190] Received LAUNCH event
> I0726 11:59:17.834415    36 executor.cpp:722] Starting task 
> d1bbb266-bee7-4c9d-929f-16aa41f4e9cf
> I0726 11:59:17.839910    36 executor.cpp:740] Forked command at 38
> Preparing rootfs at 
> '/tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791'
> Changing root to 
> /tmp/NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_1bL0mz/provisioner/containers/e8553a7c-145d-47a4-afd6-3a6cf326cd48/backends/overlay/rootfses/6a62b0ce-df7b-4bab-bf7c-633d9f860791
> Failed to execute 'sh': Exec format error
> I0726 11:59:18.113488    33 executor.cpp:1041] Command exited with status 1 
> (pid: 38)
> ../../src/tests/containerizer/nested_mesos_containerizer_tests.cpp:: 
> Failure
> Mock function called more times than expected - returning directly.
>     Function call: statusUpdate(0xc28527f0, @0xa2cf3a60 136-byte 
> object <08-05 6C-B6 FF-FF 00-00 00-00 00-00 00-00 00-00 BE-A8 00-00 00-00 
> 00-00 A8-F6 C0-B6 FF-FF 00-00 D0-04 05-94 FF-FF 00-00 A0-E6 04-94 FF-FF 00-00 
> A0-F1 05-94 FF-FF 00-00 60-78 04-94 FF-FF 00-00 ... 00-00 00-00 00-00 00-00 
> 20-BD 01-78 FF-FF 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 20-5D 87-61 A5-3F D8-41 00-00 00-00 02-00 00-00 00-00 00-00 
> 03-00 00-00>)
>          Expected: to be called twice
>            Actual: called 3 times - over-saturated and active
> I0726 11:59:19.117401    37 process.cpp:935] Stopped the socket accept 
> loop{noformat}
>  
> I asked him to provide a gdb traceback and we can see the following:
>  
> {noformat}
> Thread 1 (Thread 0xa3bc2c60 (LWP 173475)):
> #0 0xa518db20 in __libc_open64 (file=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", oflag=) at 
> ../sysdeps/unix/sysv/linux/open64.c:48
> #1 0xa513adb0 in __GI__IO_file_open (fp=fp@entry=0xaaab00e439a0, 
> filename=, posix_mode=, prot=prot@entry=438, 
> read_write=8, is32not64=) at fileops.c:189
> #2 0xa513b0b0 in _IO_new_file_fopen (fp=fp@entry=0xaaab00e439a0, 
> filename=filename@entry=0xaaab00f342e0 "/tmp/7VXP3w/pipe", mode= out>, mode@entry=0xd762f3c8 "r", is32not64=is32not64@e
> ntry=1) at fileops.c:281 
> #3 0xa512e0dc in __fopen_internal (filename=0xaaab00f342e0 
> "/tmp/7VXP3w/pipe", mode=0xd762f3c8 "r", is32=1) at iofopen.c:75
> #4 0xd54f5350 in os::read (path="/tmp/7VXP3w/pipe") at 
> ../../3rdparty/stout/include/stout/os/read.hpp:136
> #5 0xd74f1c1c in 
> mesos::internal::tests::NestedMesosContainerizerTest_ROOT_CGROUPS_INTERNET_CURL_LaunchNestedDebugCheckMntNamespace_Test::TestBody
>  (this=0xaaab00f88f50) at ../../src/tests/containeri
> zer/nested_mesos_containerizer_tests.cpp:1126
> {noformat}
>  
>  
> Basically the test uses a named pipe to synchronize with the task being 
> started, and if the task fails to start - in this case because we're trying 
> to launch an x86 container on an arm64 host - the test will just hang reading 
> from the pipe.
> I send Martin a tentative fix for him to test, and I'll open an MR if 
> successful.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)