[jira] [Commented] (MESOS-7209) Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on windows

2017-04-17 Thread Karen Huang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972150#comment-15972150
 ] 

Karen Huang commented on MESOS-7209:


I've tried to build mesos with latest revision. The issue went away. Thank you!

> Mesos failed to build due to error MSB6006: "cmd.exe" exited with code 255 on 
> windows
> -
>
> Key: MESOS-7209
> URL: https://issues.apache.org/jira/browse/MESOS-7209
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows 10 (64bit) + VS2015 Update 3
>Reporter: Karen Huang
>
> I try to build mesos with Debug|x64 configuration on Windows. It failed to 
> build due to error MSB6006: "cmd.exe" exited with code 
> 255.[F:\mesos\build_x64\ensure_tool_arch.vcxproj]. This error is reported 
> when build ensure_tool_arch.vcxproj project.
> Here is repro steps:
> 1. git clone -c core.autocrlf=true https://github.com/apache/mesos 
> F:\mesos\src
> 2. Open a VS amd64 command prompt as admin and browse to F:\mesos\src
> 3. set PreferredToolArchitecture=x64
> 4. bootstrap.bat
> 5. mkdir build_x64 && pushd build_x64
> 6. cmake ..\src -G "Visual Studio 14 2015 Win64" -DENABLE_LIBEVENT=1 
> -DHAS_AUTHENTICATION=0 -DPATCHEXE_PATH="C:\gnuwin32\bin"
> 7. msbuild Mesos.sln /p:Configuration=Debug /p:Platform=x64 /m /t:Rebuild
> Error message:
>  CustomBuild:
>  Building Custom Rule F:/mesos/src/CMakeLists.txt
>  CMake does not need to re-run because 
> F:\mesos\build_x64\CMakeFiles\generate.stamp is up-to-date.
>  ( was unexpected at this time.
> 43>C:\Program Files 
> (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): 
> error MSB6006: "cmd.exe" exited with code 255. 
> [F:\mesos\build_x64\ensure_tool_arch.vcxproj]
> If you build the project ensure_tool_arch.vcxproj in VS IDE seperatly. The 
> error info is as bleow:
> 2>-- Rebuild All started: Project: ensure_tool_arch, Configuration: Debug 
> x64 --
> 2>  Building Custom Rule D:/Mesos/src/CMakeLists.txt
> 2>  CMake does not need to re-run because 
> D:\Mesos\build_x64\CMakeFiles\generate.stamp is up-to-date.
> 2>  ( was unexpected at this time.
> 2>C:\Program Files 
> (x86)\MSBuild\Microsoft.Cpp\v4.0\V140\Microsoft.CppCommon.targets(171,5): 
> error MSB6006: "cmd.exe" exited with code 255.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7350) Failed to pull image from Nexus Registry due to signature missing.

2017-04-17 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-7350:
--
Fix Version/s: 1.3.0
   1.2.1
   1.1.2

> Failed to pull image from Nexus Registry due to signature missing.
> --
>
> Key: MESOS-7350
> URL: https://issues.apache.org/jira/browse/MESOS-7350
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Nikolay Ustinov
>Assignee: Gilbert Song
> Fix For: 1.1.2, 1.2.1, 1.3.0
>
>
> I’m trying to launch docker container with universal containerizer, mesos 
> 1.2.0. But getting error “Failed to parse the image manifest: Docker v2 image 
> manifest validation failed: ‘signatures’ field size must be at least one”. 
> And if I switch to docker containerizer, app is starting normally. 
> We are working with private docker registry v2 backed by nexus repository 
> manager  3.1.0
> {code}
> cat /etc/mesos-slave/docker_registry 
> https://docker.company.ru
> cat /etc/mesos-slave/docker_config 
> {
>   "auths": {
>   "docker.company.ru": {
>   "auth": ""
>   }
>   }
> }
> {code}
> Here agent's log:
> {code}
> I0405 22:00:49.860234 44856 slave.cpp:4346] Received ping from 
> slave-observer(7)@10.34.1.31:5050
> I0405 22:00:50.327030 44865 slave.cpp:1625] Got assigned task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.327785 44865 slave.cpp:1785] Launching task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.329324 44865 paths.cpp:547] Trying to chown 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
>  to user 'dockdata'
> I0405 22:00:50.329607 44865 slave.cpp:6896] Checkpointing ExecutorInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/executor.info'
> I0405 22:00:50.330531 44865 slave.cpp:6472] Launching executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190- with resources cpus(*)(allocated: 
> general_marathon_service_role):0.1; mem(*)(allocated: 
> general_marathon_service_role):32 in work directory 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.331244 44865 slave.cpp:6919] Checkpointing TaskInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff/tasks/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/task.info'
> I0405 22:00:50.331568 44862 docker.cpp:1106] Skipping non-docker container
> I0405 22:00:50.331822 44865 slave.cpp:2118] Queued task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.331966 44865 slave.cpp:884] Successfully attached file 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.332582 44861 containerizer.cpp:993] Starting container 
> f82f5f69-87a3-4586-b4cc-b91d285dcaff for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.333286 44862 metadata_manager.cpp:168] Looking for image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16'
> I0405 22:00:50.333627 44879 registry_puller.cpp:247] Pulling image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16' from 
> 'docker-manifest://docker.company.rucompany-infra/kafka?0.10.2.0-16#https' to 
> '/export/intssd/mesos-slave/docker-store/staging/aV2yko'
> E0405 22:00:50.834630 44872 slave.cpp:4642] Container 
> 'f82f5f69-87a3-4586-b4cc-b91d285dcaff' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190- failed to start: Failed to parse 
> the 

[jira] [Commented] (MESOS-6791) Allow to specific the device whitelist entries in cgroup devices subsystem

2017-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972022#comment-15972022
 ] 

haosdent commented on MESOS-6791:
-

Have reverted this patch since we need to change something in API level as well.
{code}
commit 3398c95b0cbdf37a7ad8078fdbdb79e020e305ca
Author: Haosdent Huang 
Date:   Tue Apr 18 10:09:23 2017 +0800

Revert "Allowed whitelist additional devices in cgroups devices subsystem."

This reverts commit ff9ed0c831c347204d065c5f39e5c8bb86f38514.
{code}

> Allow to specific the device whitelist entries in cgroup devices subsystem
> --
>
> Key: MESOS-6791
> URL: https://issues.apache.org/jira/browse/MESOS-6791
> Project: Mesos
>  Issue Type: Task
>  Components: cgroups
>Reporter: haosdent
>Assignee: haosdent
>  Labels: cgroups
> Fix For: 1.3.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7396) Build errors on a recent Linux (4.10.9)

2017-04-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

François Garillot updated MESOS-7396:
-
Environment: 
ArchLinux 

kernel 4.10.9-1-ARCH #1 SMP PREEMPT Sat Apr 8 12:39:59 CEST 2017 x86_64 
GNU/Linux

gcc (GCC) 5.3.0 and gcc (GCC) 6.3.1 20170306 (same results for both)

All this is reported on 1.2.0. I also obtained the aliasing issue on 1.1.0 
(same kernel), but did not pursue further.


  was:
ArchLinux 

kernel 4.10.9-1-ARCH #1 SMP PREEMPT Sat Apr 8 12:39:59 CEST 2017 x86_64 
GNU/Linux

gcc (GCC) 5.3.0 and gcc (GCC) 6.3.1 20170306 (same results for both)

I obtained the aliasing issue on 1.2.0



> Build errors on a recent Linux (4.10.9)
> ---
>
> Key: MESOS-7396
> URL: https://issues.apache.org/jira/browse/MESOS-7396
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: ArchLinux 
> kernel 4.10.9-1-ARCH #1 SMP PREEMPT Sat Apr 8 12:39:59 CEST 2017 x86_64 
> GNU/Linux
> gcc (GCC) 5.3.0 and gcc (GCC) 6.3.1 20170306 (same results for both)
> All this is reported on 1.2.0. I also obtained the aliasing issue on 1.1.0 
> (same kernel), but did not pursue further.
>Reporter: François Garillot
>  Labels: build-failure, build-problem
>
> A couple of issues building with the regular PKGBUILD for Archlinux:
> https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=mesos
> The build script (simple and somewhat readable) includes the following 
> notable flags :
> {code}
>  ../configure \
>   --enable-optimize \
>   --prefix=/usr \
>   --sysconfdir=/etc \
>   --libexecdir=/usr/lib \
>   --exec-prefix=/usr \
>   --sbindir=/usr/bin \
>   --with-network-isolator
>  make
> {code}
> The first set of errors is :
> {code}
> In file included from ../../src/checks/health_checker.cpp:56:0:
> ../../src/linux/ns.hpp: In function ‘Try ns::clone(pid_t, int, const 
> std::function&, int)’:
> ../../src/linux/ns.hpp:487:69: error: dereferencing type-punned pointer will 
> break strict-aliasing rules [-Werror=strict-aliasing]
>  pid_t pid = ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid;
>  ^
> ../../src/linux/ns.hpp: In lambda function:
> ../../src/linux/ns.hpp:589:59: error: dereferencing type-punned pointer will 
> break strict-aliasing rules [-Werror=strict-aliasing]
>((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid = ::getpid();
>^
> ../../src/linux/ns.hpp:590:59: error: dereferencing type-punned pointer will 
> break strict-aliasing rules [-Werror=strict-aliasing]
>((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->uid = ::getuid();
>^
> ../../src/linux/ns.hpp:591:59: error: dereferencing type-punned pointer will 
> break strict-aliasing rules [-Werror=strict-aliasing]
>((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->gid = ::getgid();
>^
> cc1plus: all warnings being treated as errors
> make[2]: *** [Makefile:6848: 
> checks/libmesos_no_3rdparty_la-health_checker.lo] Error 1
> make[2]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
> make[1]: *** [Makefile:3476: all] Error 2
> make[1]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
> make: *** [Makefile:765: all-recursive] Error 1
> ==> ERROR: A failure occurred in build().
> Aborting...
> {code}
> Full log: https://gist.github.com/7b01ff080d91780ad5e4825dff610517
> This can be fixed by adding :
> {code}
> CPPFLAGS="-fno-strict-aliasing"
> {code}
> before the above call to {{confgure}}.
> The following build error is :
> {code}
> ../../src/linux/fs.cpp:273:13: error: In the GNU C Library, "makedev" is 
> defined
>  by . For historical compatibility, it is
>  currently defined by  as well, but we plan to
>  remove this soon. To use "makedev", include 
>  directly. If you did not intend to use a system-defined macro
>  "makedev", you should undefine it after including . [-Werror]
>entry.devno = makedev(major.get(), minor.get());
>  ^
> cc1plus: all warnings being treated as errors
> make[2]: *** [Makefile:7716: linux/libmesos_no_3rdparty_la-fs.lo] Error 1
> make[2]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
> make[1]: *** [Makefile:3476: all] Error 2
> make[1]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
> make: *** [Makefile:765: all-recursive] Error 1
> ==> ERROR: A failure occurred in build().
> Aborting...
> {code}
> Full log : 
> https://gist.github.com/be7ba7cd3251ae9ac1b63b09ee2a38cf
> This is fixed by adding 
> {code}
> #include 
> {code}
> towards the end of external imports in {{src/mesos-1.2.0/src/linux/fs.cpp}}
> 

[jira] [Updated] (MESOS-7396) Build errors on a recent Linux (4.10.9)

2017-04-17 Thread JIRA

 [ 
https://issues.apache.org/jira/browse/MESOS-7396?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

François Garillot updated MESOS-7396:
-
Description: 
A couple of issues building with the regular PKGBUILD for Archlinux:
https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=mesos

The build script (simple and somewhat readable) includes the following notable 
flags :
{code}
 ../configure \
  --enable-optimize \
  --prefix=/usr \
  --sysconfdir=/etc \
  --libexecdir=/usr/lib \
  --exec-prefix=/usr \
  --sbindir=/usr/bin \
  --with-network-isolator
 make
{code}

The first set of errors is :
{code}
In file included from ../../src/checks/health_checker.cpp:56:0:
../../src/linux/ns.hpp: In function ‘Try ns::clone(pid_t, int, const 
std::function&, int)’:
../../src/linux/ns.hpp:487:69: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
 pid_t pid = ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid;
 ^
../../src/linux/ns.hpp: In lambda function:
../../src/linux/ns.hpp:589:59: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
   ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid = ::getpid();
   ^
../../src/linux/ns.hpp:590:59: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
   ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->uid = ::getuid();
   ^
../../src/linux/ns.hpp:591:59: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
   ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->gid = ::getgid();
   ^
cc1plus: all warnings being treated as errors
make[2]: *** [Makefile:6848: checks/libmesos_no_3rdparty_la-health_checker.lo] 
Error 1
make[2]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
make[1]: *** [Makefile:3476: all] Error 2
make[1]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
make: *** [Makefile:765: all-recursive] Error 1
==> ERROR: A failure occurred in build().
Aborting...
{code}

Full log: https://gist.github.com/7b01ff080d91780ad5e4825dff610517

This can be managed by adding :
{code}
CPPFLAGS="-fno-strict-aliasing"
{code}

before the above call to {{confgure}}.

The following build error is :

{code}
../../src/linux/fs.cpp:273:13: error: In the GNU C Library, "makedev" is defined
 by . For historical compatibility, it is
 currently defined by  as well, but we plan to
 remove this soon. To use "makedev", include 
 directly. If you did not intend to use a system-defined macro
 "makedev", you should undefine it after including . [-Werror]
   entry.devno = makedev(major.get(), minor.get());
 ^
cc1plus: all warnings being treated as errors
make[2]: *** [Makefile:7716: linux/libmesos_no_3rdparty_la-fs.lo] Error 1
make[2]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
make[1]: *** [Makefile:3476: all] Error 2
make[1]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
make: *** [Makefile:765: all-recursive] Error 1
==> ERROR: A failure occurred in build().
Aborting...
{code}

Full log : 
https://gist.github.com/be7ba7cd3251ae9ac1b63b09ee2a38cf

This is fixed by adding 

{code}
#include 
{code}

towards the end of external imports in {{src/mesos-1.2.0/src/linux/fs.cpp}}

Finally, the same error is triggered by the use of {{major}} and {{minor}} in 
{{src/mesos-1.2.0/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp}} 
and is fixed by the same import as well.

 (If you want to reproduce under Archlinux, use {{makepkg -e}} after any 
edition of the source, though Arch build scripts are not necessary)

  was:
A couple of issues building with the regular PKGBUILD for Archlinux:
https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=mesos

The build script (simple and somewhat readable) includes the following notable 
flags :
{code}
 ../configure \
  --enable-optimize \
  --prefix=/usr \
  --sysconfdir=/etc \
  --libexecdir=/usr/lib \
  --exec-prefix=/usr \
  --sbindir=/usr/bin \
  --with-network-isolator
 make
{code}

The first set of errors is :
{code}
In file included from ../../src/checks/health_checker.cpp:56:0:
../../src/linux/ns.hpp: In function ‘Try ns::clone(pid_t, int, const 
std::function&, int)’:
../../src/linux/ns.hpp:487:69: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
 pid_t pid = ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid;
 ^
../../src/linux/ns.hpp: In lambda function:
../../src/linux/ns.hpp:589:59: error: dereferencing 

[jira] [Commented] (MESOS-7210) HTTP health check doesn't work when mesos runs with --docker_mesos_image

2017-04-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15972021#comment-15972021
 ] 

haosdent commented on MESOS-7210:
-

Hi, [~adam-mesos] thanks a lot, have backported to 1.2.x and 1.1.x.

> HTTP health check doesn't work when mesos runs with --docker_mesos_image
> 
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: Deshi Xiao
>Priority: Critical
> Fix For: 1.1.2, 1.2.1, 1.3.0
>
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7210) HTTP health check doesn't work when mesos runs with --docker_mesos_image

2017-04-17 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-7210:

Fix Version/s: 1.2.1
   1.1.2

> HTTP health check doesn't work when mesos runs with --docker_mesos_image
> 
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: Deshi Xiao
>Priority: Critical
> Fix For: 1.1.2, 1.2.1, 1.3.0
>
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7210) HTTP health check doesn't work when mesos runs with --docker_mesos_image

2017-04-17 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-7210:

Summary: HTTP health check doesn't work when mesos runs with 
--docker_mesos_image  (was: MESOS HTTP checks doesn't work when mesos runs with 
--docker_mesos_image)

> HTTP health check doesn't work when mesos runs with --docker_mesos_image
> 
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: Deshi Xiao
>Priority: Critical
> Fix For: 1.3.0
>
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image

2017-04-17 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-7210:

Summary: MESOS HTTP checks doesn't work when mesos runs with 
--docker_mesos_image  (was: MESOS HTTP checks doesn't work when mesos runs with 
--docker_mesos_image ( pid namespace mismatch ))

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image
> 
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: Deshi Xiao
>Priority: Critical
> Fix For: 1.3.0
>
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7396) Build errors on a recent Linux (4.10.9)

2017-04-17 Thread JIRA
François Garillot created MESOS-7396:


 Summary: Build errors on a recent Linux (4.10.9)
 Key: MESOS-7396
 URL: https://issues.apache.org/jira/browse/MESOS-7396
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.2.0
 Environment: ArchLinux 

kernel 4.10.9-1-ARCH #1 SMP PREEMPT Sat Apr 8 12:39:59 CEST 2017 x86_64 
GNU/Linux

gcc (GCC) 5.3.0 and gcc (GCC) 6.3.1 20170306 (same results for both)

I obtained the aliasing issue on 1.2.0

Reporter: François Garillot


A couple of issues building with the regular PKGBUILD for Archlinux:
https://aur.archlinux.org/cgit/aur.git/tree/PKGBUILD?h=mesos

The build script (simple and somewhat readable) includes the following notable 
flags :
{code}
 ../configure \
  --enable-optimize \
  --prefix=/usr \
  --sysconfdir=/etc \
  --libexecdir=/usr/lib \
  --exec-prefix=/usr \
  --sbindir=/usr/bin \
  --with-network-isolator
 make
{code}

The first set of errors is :
{code}
In file included from ../../src/checks/health_checker.cpp:56:0:
../../src/linux/ns.hpp: In function ‘Try ns::clone(pid_t, int, const 
std::function&, int)’:
../../src/linux/ns.hpp:487:69: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
 pid_t pid = ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid;
 ^
../../src/linux/ns.hpp: In lambda function:
../../src/linux/ns.hpp:589:59: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
   ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->pid = ::getpid();
   ^
../../src/linux/ns.hpp:590:59: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
   ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->uid = ::getuid();
   ^
../../src/linux/ns.hpp:591:59: error: dereferencing type-punned pointer will 
break strict-aliasing rules [-Werror=strict-aliasing]
   ((struct ucred*) CMSG_DATA(CMSG_FIRSTHDR()))->gid = ::getgid();
   ^
cc1plus: all warnings being treated as errors
make[2]: *** [Makefile:6848: checks/libmesos_no_3rdparty_la-health_checker.lo] 
Error 1
make[2]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
make[1]: *** [Makefile:3476: all] Error 2
make[1]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
make: *** [Makefile:765: all-recursive] Error 1
==> ERROR: A failure occurred in build().
Aborting...
{code}

Full log: https://gist.github.com/7b01ff080d91780ad5e4825dff610517

This can be fixed by adding :
{code}
CPPFLAGS="-fno-strict-aliasing"
{code}

before the above call to {{confgure}}.

The following build error is :

{code}
../../src/linux/fs.cpp:273:13: error: In the GNU C Library, "makedev" is defined
 by . For historical compatibility, it is
 currently defined by  as well, but we plan to
 remove this soon. To use "makedev", include 
 directly. If you did not intend to use a system-defined macro
 "makedev", you should undefine it after including . [-Werror]
   entry.devno = makedev(major.get(), minor.get());
 ^
cc1plus: all warnings being treated as errors
make[2]: *** [Makefile:7716: linux/libmesos_no_3rdparty_la-fs.lo] Error 1
make[2]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
make[1]: *** [Makefile:3476: all] Error 2
make[1]: Leaving directory '/home/huitseeker/mesos/src/mesos-1.2.0/build/src'
make: *** [Makefile:765: all-recursive] Error 1
==> ERROR: A failure occurred in build().
Aborting...
{code}

Full log : 
https://gist.github.com/be7ba7cd3251ae9ac1b63b09ee2a38cf

This is fixed by adding 

{code}
#include 
{code}

towards the end of external imports in {{src/mesos-1.2.0/src/linux/fs.cpp}}

Finally, the same error is triggered by the use of {{major}} and {{minor}} in 
{{src/mesos-1.2.0/src/slave/containerizer/mesos/isolators/gpu/isolator.cpp}} 
and is fixed by the same import as well.

 (If you want to reproduce under Archlinux, use {{makepkg -e}} after any 
edition of the source, though Arch build scripts are not necessary)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7280) Unified containerizer provisions docker image error with COPY backend

2017-04-17 Thread depay (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971978#comment-15971978
 ] 

depay commented on MESOS-7280:
--

the major part about python in Dockerfile is something like this

{code}
from centos6

run set -eu && yum install -y python27 python27-devel python27-setuptools 
python-setuptools && mv /usr/bin/python /usr/bin/python.bak && ln -s 
/usr/bin/python2.7 /usr/bin/python && for f in /usr/bin/yum 
/usr/bin/yumdownloader;do sed -i s/python/python2.6/ $f;done

run rm -f /usr/bin/python && ln -s /usr/bin/python2.7 /usr/bin/python

run python2.7 -c "import xx" # just import something
{code}


> Unified containerizer provisions docker image error with COPY backend
> -
>
> Key: MESOS-7280
> URL: https://issues.apache.org/jira/browse/MESOS-7280
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.2, 1.2.0
> Environment: CentOS 7.2,ext4, COPY
>Reporter: depay
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: copy-backend
>
> Error occurs on some specific docker images with COPY backend, both 1.0.2 and 
> 1.2.0. It works well with OVERLAY backend on 1.2.0.
> {quote}
> I0321 09:36:07.308830 27613 paths.cpp:528] Trying to chown 
> '/data/mesos/slaves/55f6df5e-2812-40a0-baf5-ce96f20677d3-S102/frameworks/20151223-150303-2677017098-5050-30032-/executors/ct:Transcoding_Test_114489497_1490060156172:3/runs/7e518538-7b56-4b14-a3c9-bee43c669bd7'
>  to user 'root'
> I0321 09:36:07.319628 27613 slave.cpp:5703] Launching executor 
> ct:Transcoding_Test_114489497_1490060156172:3 of framework 
> 20151223-150303-2677017098-5050-30032- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/data/mesos/slaves/55f6df5e-2812-40a0-baf5-ce96f20677d3-S102/frameworks/20151223-150303-2677017098-5050-30032-/executors/ct:Transcoding_Test_114489497_1490060156172:3/runs/7e518538-7b56-4b14-a3c9-bee43c669bd7'
> I0321 09:36:07.321436 27615 containerizer.cpp:781] Starting container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7' for executor 
> 'ct:Transcoding_Test_114489497_1490060156172:3' of framework 
> '20151223-150303-2677017098-5050-30032-'
> I0321 09:36:37.902195 27600 provisioner.cpp:294] Provisioning image rootfs 
> '/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9'
>  for container 7e518538-7b56-4b14-a3c9-bee43c669bd7
> *E0321 09:36:58.707718 27606 slave.cpp:4000] Container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7' for executor 
> 'ct:Transcoding_Test_114489497_1490060156172:3' of framework 
> 20151223-150303-2677017098-5050-30032- failed to start: Collect failed: 
> Failed to copy layer: cp: cannot create regular file 
> ‘/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9/usr/bin/python’:
>  Text file busy*
> I0321 09:36:58.707991 27608 containerizer.cpp:1622] Destroying container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7'
> I0321 09:36:58.708468 27607 provisioner.cpp:434] Destroying container rootfs 
> at 
> '/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9'
>  for container 7e518538-7b56-4b14-a3c9-bee43c669bd7
> {quote}
> Docker image is a private one, so that i have to try to reproduce this bug 
> with some sample Dockerfile as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7350) Failed to pull image from Nexus Registry due to signature missing.

2017-04-17 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971951#comment-15971951
 ] 

Gilbert Song commented on MESOS-7350:
-

I will close this JIRA once the patch is backported.

> Failed to pull image from Nexus Registry due to signature missing.
> --
>
> Key: MESOS-7350
> URL: https://issues.apache.org/jira/browse/MESOS-7350
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Nikolay Ustinov
>Assignee: Gilbert Song
>
> I’m trying to launch docker container with universal containerizer, mesos 
> 1.2.0. But getting error “Failed to parse the image manifest: Docker v2 image 
> manifest validation failed: ‘signatures’ field size must be at least one”. 
> And if I switch to docker containerizer, app is starting normally. 
> We are working with private docker registry v2 backed by nexus repository 
> manager  3.1.0
> {code}
> cat /etc/mesos-slave/docker_registry 
> https://docker.company.ru
> cat /etc/mesos-slave/docker_config 
> {
>   "auths": {
>   "docker.company.ru": {
>   "auth": ""
>   }
>   }
> }
> {code}
> Here agent's log:
> {code}
> I0405 22:00:49.860234 44856 slave.cpp:4346] Received ping from 
> slave-observer(7)@10.34.1.31:5050
> I0405 22:00:50.327030 44865 slave.cpp:1625] Got assigned task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.327785 44865 slave.cpp:1785] Launching task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.329324 44865 paths.cpp:547] Trying to chown 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
>  to user 'dockdata'
> I0405 22:00:50.329607 44865 slave.cpp:6896] Checkpointing ExecutorInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/executor.info'
> I0405 22:00:50.330531 44865 slave.cpp:6472] Launching executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190- with resources cpus(*)(allocated: 
> general_marathon_service_role):0.1; mem(*)(allocated: 
> general_marathon_service_role):32 in work directory 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.331244 44865 slave.cpp:6919] Checkpointing TaskInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff/tasks/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/task.info'
> I0405 22:00:50.331568 44862 docker.cpp:1106] Skipping non-docker container
> I0405 22:00:50.331822 44865 slave.cpp:2118] Queued task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.331966 44865 slave.cpp:884] Successfully attached file 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.332582 44861 containerizer.cpp:993] Starting container 
> f82f5f69-87a3-4586-b4cc-b91d285dcaff for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.333286 44862 metadata_manager.cpp:168] Looking for image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16'
> I0405 22:00:50.333627 44879 registry_puller.cpp:247] Pulling image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16' from 
> 'docker-manifest://docker.company.rucompany-infra/kafka?0.10.2.0-16#https' to 
> '/export/intssd/mesos-slave/docker-store/staging/aV2yko'
> E0405 22:00:50.834630 44872 slave.cpp:4642] Container 
> 'f82f5f69-87a3-4586-b4cc-b91d285dcaff' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190- failed to start: Failed to parse 
> the image manifest: 

[jira] [Commented] (MESOS-7350) Failed to pull image from Nexus Registry due to signature missing.

2017-04-17 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971947#comment-15971947
 ] 

Gilbert Song commented on MESOS-7350:
-

commit 643dafdec76bb176270fe686ec2400242ed0fe36
Author: Gilbert Song songzihao1...@gmail.com
Date:   Tue Apr 18 07:57:30 2017 +0800

Fixed the image signature check for Nexus Registry.

Currently, the signature field of the docker v2 image manifest is
not used yet. The check of at least one image signature is too
strict because some registry (e.g., Nexus Registry) does not sign
the image manifest. We should release the signature check for now.

Review: https://reviews.apache.org/r/58479/

> Failed to pull image from Nexus Registry due to signature missing.
> --
>
> Key: MESOS-7350
> URL: https://issues.apache.org/jira/browse/MESOS-7350
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Nikolay Ustinov
>Assignee: Gilbert Song
>
> I’m trying to launch docker container with universal containerizer, mesos 
> 1.2.0. But getting error “Failed to parse the image manifest: Docker v2 image 
> manifest validation failed: ‘signatures’ field size must be at least one”. 
> And if I switch to docker containerizer, app is starting normally. 
> We are working with private docker registry v2 backed by nexus repository 
> manager  3.1.0
> {code}
> cat /etc/mesos-slave/docker_registry 
> https://docker.company.ru
> cat /etc/mesos-slave/docker_config 
> {
>   "auths": {
>   "docker.company.ru": {
>   "auth": ""
>   }
>   }
> }
> {code}
> Here agent's log:
> {code}
> I0405 22:00:49.860234 44856 slave.cpp:4346] Received ping from 
> slave-observer(7)@10.34.1.31:5050
> I0405 22:00:50.327030 44865 slave.cpp:1625] Got assigned task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.327785 44865 slave.cpp:1785] Launching task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.329324 44865 paths.cpp:547] Trying to chown 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
>  to user 'dockdata'
> I0405 22:00:50.329607 44865 slave.cpp:6896] Checkpointing ExecutorInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/executor.info'
> I0405 22:00:50.330531 44865 slave.cpp:6472] Launching executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190- with resources cpus(*)(allocated: 
> general_marathon_service_role):0.1; mem(*)(allocated: 
> general_marathon_service_role):32 in work directory 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.331244 44865 slave.cpp:6919] Checkpointing TaskInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff/tasks/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/task.info'
> I0405 22:00:50.331568 44862 docker.cpp:1106] Skipping non-docker container
> I0405 22:00:50.331822 44865 slave.cpp:2118] Queued task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.331966 44865 slave.cpp:884] Successfully attached file 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.332582 44861 containerizer.cpp:993] Starting container 
> f82f5f69-87a3-4586-b4cc-b91d285dcaff for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.333286 44862 metadata_manager.cpp:168] Looking for image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16'
> I0405 22:00:50.333627 44879 registry_puller.cpp:247] Pulling image 
> 

[jira] [Commented] (MESOS-7350) Failed to pull image from Nexus Registry due to signature missing.

2017-04-17 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971946#comment-15971946
 ] 

Gilbert Song commented on MESOS-7350:
-

[~adam-mesos], ah, it was resolved one hour ago.

> Failed to pull image from Nexus Registry due to signature missing.
> --
>
> Key: MESOS-7350
> URL: https://issues.apache.org/jira/browse/MESOS-7350
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Nikolay Ustinov
>Assignee: Gilbert Song
>
> I’m trying to launch docker container with universal containerizer, mesos 
> 1.2.0. But getting error “Failed to parse the image manifest: Docker v2 image 
> manifest validation failed: ‘signatures’ field size must be at least one”. 
> And if I switch to docker containerizer, app is starting normally. 
> We are working with private docker registry v2 backed by nexus repository 
> manager  3.1.0
> {code}
> cat /etc/mesos-slave/docker_registry 
> https://docker.company.ru
> cat /etc/mesos-slave/docker_config 
> {
>   "auths": {
>   "docker.company.ru": {
>   "auth": ""
>   }
>   }
> }
> {code}
> Here agent's log:
> {code}
> I0405 22:00:49.860234 44856 slave.cpp:4346] Received ping from 
> slave-observer(7)@10.34.1.31:5050
> I0405 22:00:50.327030 44865 slave.cpp:1625] Got assigned task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.327785 44865 slave.cpp:1785] Launching task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.329324 44865 paths.cpp:547] Trying to chown 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
>  to user 'dockdata'
> I0405 22:00:50.329607 44865 slave.cpp:6896] Checkpointing ExecutorInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/executor.info'
> I0405 22:00:50.330531 44865 slave.cpp:6472] Launching executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190- with resources cpus(*)(allocated: 
> general_marathon_service_role):0.1; mem(*)(allocated: 
> general_marathon_service_role):32 in work directory 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.331244 44865 slave.cpp:6919] Checkpointing TaskInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff/tasks/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/task.info'
> I0405 22:00:50.331568 44862 docker.cpp:1106] Skipping non-docker container
> I0405 22:00:50.331822 44865 slave.cpp:2118] Queued task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.331966 44865 slave.cpp:884] Successfully attached file 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.332582 44861 containerizer.cpp:993] Starting container 
> f82f5f69-87a3-4586-b4cc-b91d285dcaff for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.333286 44862 metadata_manager.cpp:168] Looking for image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16'
> I0405 22:00:50.333627 44879 registry_puller.cpp:247] Pulling image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16' from 
> 'docker-manifest://docker.company.rucompany-infra/kafka?0.10.2.0-16#https' to 
> '/export/intssd/mesos-slave/docker-store/staging/aV2yko'
> E0405 22:00:50.834630 44872 slave.cpp:4642] Container 
> 'f82f5f69-87a3-4586-b4cc-b91d285dcaff' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190- failed to start: Failed to parse 
> the image manifest: 

[jira] [Commented] (MESOS-5172) Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.

2017-04-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971942#comment-15971942
 ] 

Adam B commented on MESOS-5172:
---

[~jieyu] Could you please backport these patches to 1.2.x and 1.1.x if we're 
still targeting them for those patch releases?

> Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.
> -
>
> Key: MESOS-5172
> URL: https://issues.apache.org/jira/browse/MESOS-5172
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: containerizer, mesosphere
> Fix For: 1.3.0
>
>
> When the registry puller is pulling a private repository from some private 
> registry (e.g., quay.io), errors may occur when fetching blobs, at which 
> point fetching the manifest of the repo is finished correctly. The error 
> message is `Unexpected HTTP response '400 Bad Request' when trying to 
> download the blob`. This may arise from the logic of fetching blobs, or 
> incorrect format of uri when requesting blobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7210) MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( pid namespace mismatch )

2017-04-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971939#comment-15971939
 ] 

Adam B commented on MESOS-7210:
---

[~haosd...@gmail.com], could you please backport this to the 1.2.x and 1.1.x 
branches so we can include it in the next patch releases (1.2.1 and 1.1.2)? 
Hoping to cut those this week.

> MESOS HTTP checks doesn't work when mesos runs with --docker_mesos_image ( 
> pid namespace mismatch )
> ---
>
> Key: MESOS-7210
> URL: https://issues.apache.org/jira/browse/MESOS-7210
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.1.0, 1.1.1, 1.2.0
> Environment: Ubuntu 16.04.02
> Docker version 1.13.1
> mesos 1.1.0, runs from container
> docker containers  spawned by marathon 1.4.1
>Reporter: Wojciech Sielski
>Assignee: Deshi Xiao
>Priority: Critical
> Fix For: 1.3.0
>
>
> When running mesos-slave with option "docker_mesos_image" like:
> {code}
> --master=zk://standalone:2181/mesos  --containerizers=docker,mesos  
> --executor_registration_timeout=5mins  --hostname=standalone  --ip=0.0.0.0  
> --docker_stop_timeout=5secs  --gc_delay=1days  
> --docker_socket=/var/run/docker.sock  --no-systemd_enable_support  
> --work_dir=/tmp/mesos  --docker_mesos_image=panteras/paas-in-a-box:0.4.0
> {code}
> from the container that was started with option "pid: host" like:
> {code}
>   net:host
>   privileged: true
>   pid:host
> {code}
> and example marathon job, that use MESOS_HTTP checks like:
> {code}
> {
>  "id": "python-example-stable",
>  "cmd": "python3 -m http.server 8080",
>  "mem": 16,
>  "cpus": 0.1,
>  "instances": 2,
>  "container": {
>"type": "DOCKER",
>"docker": {
>  "image": "python:alpine",
>  "network": "BRIDGE",
>  "portMappings": [
> { "containerPort": 8080, "hostPort": 0, "protocol": "tcp" }
>  ]
>}
>  },
>  "env": {
>"SERVICE_NAME" : "python"
>  },
>  "healthChecks": [
>{
>  "path": "/",
>  "portIndex": 0,
>  "protocol": "MESOS_HTTP",
>  "gracePeriodSeconds": 30,
>  "intervalSeconds": 10,
>  "timeoutSeconds": 30,
>  "maxConsecutiveFailures": 3
>}
>  ]
> }
> {code}
> I see the errors like:
> {code}
> F0306 07:41:58.84429335 health_checker.cpp:94] Failed to enter the net 
> namespace of task (pid: '13527'): Pid 13527 does not exist
> *** Check failure stack trace: ***
> @ 0x7f51770b0c1d  google::LogMessage::Fail()
> @ 0x7f51770b29d0  google::LogMessage::SendToLog()
> @ 0x7f51770b0803  google::LogMessage::Flush()
> @ 0x7f51770b33f9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f517647ce46  
> _ZNSt17_Function_handlerIFivEZN5mesos8internal6health14cloneWithSetnsERKSt8functionIS0_E6OptionIiERKSt6vectorINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEESaISG_EEEUlvE_E9_M_invokeERKSt9_Any_data
> @ 0x7f517647bf2b  mesos::internal::health::cloneWithSetns()
> @ 0x7f517648374b  std::_Function_handler<>::_M_invoke()
> @ 0x7f5177068167  process::internal::cloneChild()
> @ 0x7f5177065c32  process::subprocess()
> @ 0x7f5176481a9d  
> mesos::internal::health::HealthCheckerProcess::_httpHealthCheck()
> @ 0x7f51764831f7  
> mesos::internal::health::HealthCheckerProcess::_healthCheck()
> @ 0x7f517701f38c  process::ProcessBase::visit()
> @ 0x7f517702c8b3  process::ProcessManager::resume()
> @ 0x7f517702fb77  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f51754ddc80  (unknown)
> @ 0x7f5174cf06ba  start_thread
> @ 0x7f5174a2682d  (unknown)
> I0306 07:41:59.077986 9 health_checker.cpp:199] Ignoring failure as 
> health check still in grace period
> {code}
> Looks like option docker_mesos_image makes, that newly started mesos job is 
> not using "pid host" option same as mother container was started, but has his 
> own PID namespace (so it doesn't matter if mother container was started with 
> "pid host" or not it will never be able to find PID)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7272) Unified containerizer does not support docker registry version < 2.3.

2017-04-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971937#comment-15971937
 ] 

Adam B commented on MESOS-7272:
---

Any progress here [~gilbert], [~jieyu]? Looks like it's marked as a Blocker for 
1.3.0/1.2.1/1.1.2, so we'd like to land it this week (I see it's in the current 
sprint).

> Unified containerizer does not support docker registry version < 2.3.
> -
>
> Key: MESOS-7272
> URL: https://issues.apache.org/jira/browse/MESOS-7272
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.2, 1.1.1, 1.2.0
>Reporter: depay
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: easyfix
>
> in file `src/uri/fetchers/docker.cpp`
> ```
> Option contentType = response.headers.get("Content-Type");  
> if (contentType.isSome() &&  
> !strings::startsWith(  
> contentType.get(),  
> "application/vnd.docker.distribution.manifest.v1")) {  
>   return Failure(  
>   "Unsupported manifest MIME type: " + contentType.get());  
> }  
> ```
> Docker fetcher check the contentType strictly, while docker registry with 
> version < 2.3 returns manifests with contentType `application/json`, that 
> leading failure like `E0321 13:27:27.572402 40370 slave.cpp:4650] Container 
> 'xxx' for executor 'xxx' of framework xxx failed to start: Unsupported 
> manifest MIME type: application/json; charset=utf-8`.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7389) Mesos 1.2.0 crashes with pre-1.0 Mesos agents

2017-04-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7389?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971933#comment-15971933
 ] 

Adam B commented on MESOS-7389:
---

[~bmahler] Will you have time this week to fix this for 1.2.1/1.3.0? Who's the 
shepherd?

> Mesos 1.2.0 crashes with pre-1.0 Mesos agents
> -
>
> Key: MESOS-7389
> URL: https://issues.apache.org/jira/browse/MESOS-7389
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
> Environment: Ubuntu 14.04 
>Reporter: Nicholas Studt
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: mesosphere
>
> During upgrade from 1.0.1 to 1.2.0 a single mesos-slave reregistering with 
> the running leader caused the leader to terminate. All 3 of the masters 
> suffered the same failure as the same slave node reregistered against the new 
> leader, this continued across the entire cluster until the offending slave 
> node was removed and fixed. The fix to the slave node was to remove the mesos 
> directory and then start the slave node back up. 
>  F0412 17:24:42.736600  6317 master.cpp:5701] Check failed: 
> frameworks_.contains(task.framework_id())
>  *** Check failure stack trace: ***
>  @ 0x7f59f944f94d  google::LogMessage::Fail()
>  @ 0x7f59f945177d  google::LogMessage::SendToLog()
>  @ 0x7f59f944f53c  google::LogMessage::Flush()
>  @ 0x7f59f9452079  google::LogMessageFatal::~LogMessageFatal()
>  I0412 17:24:42.750300  6316 replica.cpp:693] Replica received learned notice 
> for position 6896 from @0.0.0.0:0 
>  @ 0x7f59f88f2341  mesos::internal::master::Master::_reregisterSlave()
>  @ 0x7f59f88f488f  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERKSt6vectorINS5_8ResourceESaISG_EERKSF_INS5_12ExecutorInfoESaISL_EERKSF_INS5_4TaskESaISQ_EERKSF_INS5_13FrameworkInfoESaISV_EERKSF_INS6_17Archive_FrameworkESaIS10_EERKSsRKSF_INS5_20SlaveInfo_CapabilityESaIS17_EERKNS0_6FutureIbEES9_SC_SI_SN_SS_SX_S12_SsS19_S1D_EEvRKNS0_3PIDIT_EEMS1H_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_ET10_T11_T12_T13_T14_T15_T16_T17_T18_T19_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_
>  @ 0x7f59f93c3eb1  process::ProcessManager::resume()
>  @ 0x7f59f93ccd57  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
>  @ 0x7f59f77cfa60  (unknown)
>  @ 0x7f59f6fec184  start_thread
>  @ 0x7f59f6d19bed  (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7346) Agent crashes if the task name is too long

2017-04-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971928#comment-15971928
 ] 

Adam B commented on MESOS-7346:
---

Looks like [~jieyu] committed the patch yesterday. Can you please update the 
fixVersion/status/shepherd for this ticket appropriately, and backport as 
needed?

> Agent crashes if the task name is too long
> --
>
> Key: MESOS-7346
> URL: https://issues.apache.org/jira/browse/MESOS-7346
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Critical
>
> While making a load testing tool that wrongly generated very long task names 
> I found that the agent crashes:
> {code}
> I0404 18:59:26.716114  5145 slave.cpp:1701] Launching task 'test 
> application43109915684310991568431099156843109915684310991568431099156843109915694310991569431099156943109915694310991569431099156943109915704310991570431099157043109915704310991570431099157143109915704310991571431099157143109915714310991572431099157243109915714310991571-6023D486-022C-40AC-BC24-42D07EFA8CB8'
>  for framework 85ed4b54-b2f5-4513-9179-b18de7120f9b-0003
> F0404 18:59:26.716377  5145 paths.cpp:508] CHECK_SOME(mkdir): File name too 
> long Failed to create executor directory 
> '/tmp/slave/slaves/85ed4b54-b2f5-4513-9179-b18de7120f9b-S0/frameworks/85ed4b54-b2f5-4513-9179-b18de7120f9b-0003/executors/test
>  
> application43109915684310991568431099156843109915684310991568431099156843109915694310991569431099156943109915694310991569431099156943109915704310991570431099157043109915704310991570431099157143109915704310991571431099157143109915714310991572431099157243109915714310991571-6023D486-022C-40AC-BC24-42D07EFA8CB8/runs/f913fd46-b0a5-439a-a674-8e4a19aa9df3'
> *** Check failure stack trace: ***
> @ 0x7f247f2f7a46  google::LogMessage::Fail()
> @ 0x7f247f2f798a  google::LogMessage::SendToLog()
> @ 0x7f247f2f735c  google::LogMessage::Flush()
> @ 0x7f247f2fa61a  google::LogMessageFatal::~LogMessageFatal()
> @   0x480c42  _CheckFatal::~_CheckFatal()
> @ 0x7f247e5046a8  
> mesos::internal::slave::paths::createExecutorDirectory()
> @ 0x7f247e540cf9  mesos::internal::slave::Framework::launchExecutor()
> @ 0x7f247e51c337  mesos::internal::slave::Slave::_run()
> @ 0x7f247e577af6  
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_
> @ 0x7f247e5af990  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f247f284187  std::function<>::operator()()
> @ 0x7f247f26503e  process::ProcessBase::visit()
> @ 0x7f247f26dad0  process::DispatchEvent::visit()
> @ 0x7f247dcbea08  process::ProcessBase::serve()
> @ 0x7f247f260efa  process::ProcessManager::resume()
> @ 0x7f247f25da22  
> _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
> @ 0x7f247f26d0f2  
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7f247f26d048  
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
> @ 0x7f247f26cfd8  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f2479711c80  (unknown)
> @ 0x7f247922d6ba  start_thread
> @ 0x7f2478f6382d  (unknown)
> Aborted (core dumped)
> {code}
> https://reviews.apache.org/r/58317/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7374) Running DOCKER images in Mesos Container Runtime without `linux/filesystem` isolation enabled renders host unusable

2017-04-17 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7374:
--
Priority: Critical  (was: Major)

> Running DOCKER images in Mesos Container Runtime without `linux/filesystem` 
> isolation enabled renders host unusable
> ---
>
> Key: MESOS-7374
> URL: https://issues.apache.org/jira/browse/MESOS-7374
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 1.2.0
>Reporter: Tim Harper
>Priority: Critical
>  Labels: containerizer, mesosphere
>
> If I run the pod below (using Marathon 1.4.2) against a mesos agent that has 
> the flags (also below), then the overlay filesystem replaces the system root 
> mount, effectively rendering the host unusable until reboot.
> flags:
> - {{--containerizers mesos,docker}}
> - {{--image_providers APPC,DOCKER}}
> - {{--isolation cgroups/cpu,cgroups/mem,docker/runtime}}
> pod definition for Marathon:
> {code:java}
> {
>   "id": "/simplepod",
>   "scaling": { "kind": "fixed", "instances": 1 },
>   "containers": [
> {
>   "name": "sleep1",
>   "exec": { "command": { "shell": "sleep 1000" } },
>   "resources": { "cpus": 0.1, "mem": 32 },
>   "image": {
> "id": "alpine",
> "kind": "DOCKER"
>   }
> }
>   ],
>   "networks": [ {"mode": "host"} ]
> }
> {code}
> Mesos should probably check for this and avoid replacing the system root 
> mount point at startup or launch time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7374) Running DOCKER images in Mesos Container Runtime without `linux/filesystem` isolation enabled renders host unusable

2017-04-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971923#comment-15971923
 ] 

Adam B commented on MESOS-7374:
---

[~gilbert] Who's going to work on this issue and when? We're hoping to cut 
1.3.0 and 1.2.1 this week, and it'd be great to include this.

> Running DOCKER images in Mesos Container Runtime without `linux/filesystem` 
> isolation enabled renders host unusable
> ---
>
> Key: MESOS-7374
> URL: https://issues.apache.org/jira/browse/MESOS-7374
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Affects Versions: 1.2.0
>Reporter: Tim Harper
>  Labels: containerizer, mesosphere
>
> If I run the pod below (using Marathon 1.4.2) against a mesos agent that has 
> the flags (also below), then the overlay filesystem replaces the system root 
> mount, effectively rendering the host unusable until reboot.
> flags:
> - {{--containerizers mesos,docker}}
> - {{--image_providers APPC,DOCKER}}
> - {{--isolation cgroups/cpu,cgroups/mem,docker/runtime}}
> pod definition for Marathon:
> {code:java}
> {
>   "id": "/simplepod",
>   "scaling": { "kind": "fixed", "instances": 1 },
>   "containers": [
> {
>   "name": "sleep1",
>   "exec": { "command": { "shell": "sleep 1000" } },
>   "resources": { "cpus": 0.1, "mem": 32 },
>   "image": {
> "id": "alpine",
> "kind": "DOCKER"
>   }
> }
>   ],
>   "networks": [ {"mode": "host"} ]
> }
> {code}
> Mesos should probably check for this and avoid replacing the system root 
> mount point at startup or launch time.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7350) Failed to pull image from Nexus Registry due to signature missing.

2017-04-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971922#comment-15971922
 ] 

Adam B commented on MESOS-7350:
---

[~gilbert] When do you think this issue can be resolved? Any chance it'll 
actually make it in this week for 1.3.0 or 1.2.1?

> Failed to pull image from Nexus Registry due to signature missing.
> --
>
> Key: MESOS-7350
> URL: https://issues.apache.org/jira/browse/MESOS-7350
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Nikolay Ustinov
>Assignee: Gilbert Song
>
> I’m trying to launch docker container with universal containerizer, mesos 
> 1.2.0. But getting error “Failed to parse the image manifest: Docker v2 image 
> manifest validation failed: ‘signatures’ field size must be at least one”. 
> And if I switch to docker containerizer, app is starting normally. 
> We are working with private docker registry v2 backed by nexus repository 
> manager  3.1.0
> {code}
> cat /etc/mesos-slave/docker_registry 
> https://docker.company.ru
> cat /etc/mesos-slave/docker_config 
> {
>   "auths": {
>   "docker.company.ru": {
>   "auth": ""
>   }
>   }
> }
> {code}
> Here agent's log:
> {code}
> I0405 22:00:49.860234 44856 slave.cpp:4346] Received ping from 
> slave-observer(7)@10.34.1.31:5050
> I0405 22:00:50.327030 44865 slave.cpp:1625] Got assigned task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.327785 44865 slave.cpp:1785] Launching task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.329324 44865 paths.cpp:547] Trying to chown 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
>  to user 'dockdata'
> I0405 22:00:50.329607 44865 slave.cpp:6896] Checkpointing ExecutorInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/executor.info'
> I0405 22:00:50.330531 44865 slave.cpp:6472] Launching executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190- with resources cpus(*)(allocated: 
> general_marathon_service_role):0.1; mem(*)(allocated: 
> general_marathon_service_role):32 in work directory 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.331244 44865 slave.cpp:6919] Checkpointing TaskInfo to 
> '/export/intssd/mesos-slave/workdir/meta/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff/tasks/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/task.info'
> I0405 22:00:50.331568 44862 docker.cpp:1106] Skipping non-docker container
> I0405 22:00:50.331822 44865 slave.cpp:2118] Queued task 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.331966 44865 slave.cpp:884] Successfully attached file 
> '/export/intssd/mesos-slave/workdir/slaves/5ad97c04-d982-49d3-ac4f-53c468993190-S1/frameworks/5ad97c04-d982-49d3-ac4f-53c468993190-/executors/md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14/runs/f82f5f69-87a3-4586-b4cc-b91d285dcaff'
> I0405 22:00:50.332582 44861 containerizer.cpp:993] Starting container 
> f82f5f69-87a3-4586-b4cc-b91d285dcaff for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 5ad97c04-d982-49d3-ac4f-53c468993190-
> I0405 22:00:50.333286 44862 metadata_manager.cpp:168] Looking for image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16'
> I0405 22:00:50.333627 44879 registry_puller.cpp:247] Pulling image 
> 'docker.company.ru/company-infra/kafka:0.10.2.0-16' from 
> 'docker-manifest://docker.company.rucompany-infra/kafka?0.10.2.0-16#https' to 
> '/export/intssd/mesos-slave/docker-store/staging/aV2yko'
> E0405 22:00:50.834630 44872 slave.cpp:4642] Container 
> 'f82f5f69-87a3-4586-b4cc-b91d285dcaff' for executor 
> 'md_kafka_broker.2f58917d-1a32-11e7-ad66-02424dd04a14' of framework 
> 

[jira] [Comment Edited] (MESOS-6405) Benchmark call ingestion path on the Mesos master.

2017-04-17 Thread Adam B (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6405?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971908#comment-15971908
 ] 

Adam B edited comment on MESOS-6405 at 4/18/17 12:47 AM:
-

Patch was discarded over a month ago due to inactivity, so I'm moving this back 
to "Accepted" and removing the 1.2.1 target Version, since it doesn't seem 
urgent enough for a backport, or actually in progress/review for 1.2.x.
Let's land it in master when we can and then consider backporting if necessary.


was (Author: adam-mesos):
Patch was discarded over a month ago due to inactivity, so I'm moving this back 
to "Accepted" and removing the 1.2.1 fixVersion, since it doesn't seem urgent 
enough for a backport, or actually in progress/review for 1.2.x.
Let's land it in master when we can and then consider backporting if necessary.

> Benchmark call ingestion path on the Mesos master.
> --
>
> Key: MESOS-6405
> URL: https://issues.apache.org/jira/browse/MESOS-6405
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>Priority: Critical
>  Labels: mesosphere
>
> [~drexin] reported on the user mailing 
> [list|http://mail-archives.apache.org/mod_mbox/mesos-user/201610.mbox/%3C6B42E374-9AB7--A315-A6558753E08B%40apple.com%3E]
>  that there seems to be a significant regression in performance on the call 
> ingestion path on the Mesos master wrt to the scheduler driver (v0 API). 
> We should create a benchmark to first get a sense of the numbers and then go 
> about fixing the performance issues. 



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7346) Agent crashes if the task name is too long

2017-04-17 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7346:
--
Target Version/s: 1.1.2, 1.2.1, 1.3.0  (was: 1.2.1, 1.3.0)

> Agent crashes if the task name is too long
> --
>
> Key: MESOS-7346
> URL: https://issues.apache.org/jira/browse/MESOS-7346
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.0.0, 1.0.1, 1.0.2, 1.1.0, 1.1.1, 1.2.0
>Reporter: Aaron Wood
>Assignee: Aaron Wood
>Priority: Critical
>
> While making a load testing tool that wrongly generated very long task names 
> I found that the agent crashes:
> {code}
> I0404 18:59:26.716114  5145 slave.cpp:1701] Launching task 'test 
> application43109915684310991568431099156843109915684310991568431099156843109915694310991569431099156943109915694310991569431099156943109915704310991570431099157043109915704310991570431099157143109915704310991571431099157143109915714310991572431099157243109915714310991571-6023D486-022C-40AC-BC24-42D07EFA8CB8'
>  for framework 85ed4b54-b2f5-4513-9179-b18de7120f9b-0003
> F0404 18:59:26.716377  5145 paths.cpp:508] CHECK_SOME(mkdir): File name too 
> long Failed to create executor directory 
> '/tmp/slave/slaves/85ed4b54-b2f5-4513-9179-b18de7120f9b-S0/frameworks/85ed4b54-b2f5-4513-9179-b18de7120f9b-0003/executors/test
>  
> application43109915684310991568431099156843109915684310991568431099156843109915694310991569431099156943109915694310991569431099156943109915704310991570431099157043109915704310991570431099157143109915704310991571431099157143109915714310991572431099157243109915714310991571-6023D486-022C-40AC-BC24-42D07EFA8CB8/runs/f913fd46-b0a5-439a-a674-8e4a19aa9df3'
> *** Check failure stack trace: ***
> @ 0x7f247f2f7a46  google::LogMessage::Fail()
> @ 0x7f247f2f798a  google::LogMessage::SendToLog()
> @ 0x7f247f2f735c  google::LogMessage::Flush()
> @ 0x7f247f2fa61a  google::LogMessageFatal::~LogMessageFatal()
> @   0x480c42  _CheckFatal::~_CheckFatal()
> @ 0x7f247e5046a8  
> mesos::internal::slave::paths::createExecutorDirectory()
> @ 0x7f247e540cf9  mesos::internal::slave::Framework::launchExecutor()
> @ 0x7f247e51c337  mesos::internal::slave::Slave::_run()
> @ 0x7f247e577af6  
> _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_
> @ 0x7f247e5af990  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f247f284187  std::function<>::operator()()
> @ 0x7f247f26503e  process::ProcessBase::visit()
> @ 0x7f247f26dad0  process::DispatchEvent::visit()
> @ 0x7f247dcbea08  process::ProcessBase::serve()
> @ 0x7f247f260efa  process::ProcessManager::resume()
> @ 0x7f247f25da22  
> _ZZN7process14ProcessManager12init_threadsEvENKUt_clEv
> @ 0x7f247f26d0f2  
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE
> @ 0x7f247f26d048  
> _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEclEv
> @ 0x7f247f26cfd8  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f2479711c80  (unknown)
> @ 0x7f247922d6ba  start_thread
> @ 0x7f2478f6382d  (unknown)
> Aborted (core dumped)
> {code}
> https://reviews.apache.org/r/58317/



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7223) Linux filesystem isolator cannot mount host volume /dev/log.

2017-04-17 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7223:
--
Target Version/s: 1.2.1
   Fix Version/s: (was: 1.2.1)

> Linux filesystem isolator cannot mount host volume /dev/log.
> 
>
> Key: MESOS-7223
> URL: https://issues.apache.org/jira/browse/MESOS-7223
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.0.2, 1.1.0, 1.2.0
>Reporter: Haralds Ulmanis
>  Labels: volumes
>
> I'm trying to mount /dev/log.
> ls -l /dev/log
> lrwxrwxrwx 1 root root 28 Mar  9 01:49 /dev/log -> 
> /run/systemd/journal/dev-log
> # ls -l /run/systemd/journal/dev-log
> srw-rw-rw- 1 root root 0 Mar  9 01:49 /run/systemd/journal/dev-log
> I have tried mounting /dev/log and /run/systemd/journal/dev-log, both produce 
> same errors:
> from stdout:
> Executing pre-exec command 
> '{"arguments":["mesos-containerizer","mount","--help=false","--operation=make-rslave","--path=\/"],"shell":false,"value":"\/usr\/lib\/mesos\/mesos-containerizer"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/data\/mesos-agent\/slaves\/9b7ad711-9381-4338-b3c0-dac86253701e-S93\/frameworks\/a872f621-d10f-4021-a886-c5d564df104e-\/executors\/services_dev-2_lb-6.b8202973-04b0-11e7-be02-0a2b9a5c33cf\/runs\/cfb170f0-6c69-4475-9dbe-bb9967e19b42","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/data\/mesos-agent\/sandbox"],"shell":false,"value":"mount"}'
> Executing pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/run\/systemd\/journal\/dev-log","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/dev\/log"],"shell":false,"value":"mount"}'
> from stderr:
> mount: mount(2) failed: 
> /data/mesos-agent/provisioner/containers/cfb170f0-6c69-4475-9dbe-bb9967e19b42/backends/overlay/rootfses/890a25e6-cb15-42e3-be9c-0aa3baf889f8/dev/log:
>  Not a directory
> Failed to execute pre-exec command 
> '{"arguments":["mount","-n","--rbind","\/run\/systemd\/journal\/dev-log","\/data\/mesos-agent\/provisioner\/containers\/cfb170f0-6c69-4475-9dbe-bb9967e19b42\/backends\/overlay\/rootfses\/890a25e6-cb15-42e3-be9c-0aa3baf889f8\/dev\/log"],"shell":false,"value":"mount"}'
> This particular job  i start from marathon and have the following definition 
> (if I change MESOS to DOCKER - it works): 
> "container": {
> "type": "MESOS",
> "volumes": [
>   {
> "hostPath": "/run/systemd/journal/dev-log",
> "containerPath": "/dev/log",
> "mode": "RW"
>   }
> ],
> "docker": {
>   "image": "",
>   "credential": null,
>   "forcePullImage": true
> }
>   },



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7316) Upgrading Mesos to 1.2.0 results in some information missing from the `/flags` endpoint.

2017-04-17 Thread Michael Park (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Park updated MESOS-7316:

Fix Version/s: 1.2.1

> Upgrading Mesos to 1.2.0 results in some information missing from the 
> `/flags` endpoint.
> 
>
> Key: MESOS-7316
> URL: https://issues.apache.org/jira/browse/MESOS-7316
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.2.0
>Reporter: Anand Mazumdar
>Assignee: Benjamin Bannier
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.2.1, 1.3.0
>
>
> From OSS Mesos Slack:
> I recently tried upgrading one of our Mesos clusters from 1.1.0 to 1.2.0. 
> After doing this, it looks like the {{zk}} field on the {{/master/flags}} 
> endpoint is no longer present. 
> This looks related to the recent {{Flags}} refactoring that was done which 
> resulted in some flags no longer being populated since they were not part of 
> {{master::Flags}} in {{src/master/flags.hpp}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7316) Upgrading Mesos to 1.2.0 results in some information missing from the `/flags` endpoint.

2017-04-17 Thread Michael Park (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7316?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971843#comment-15971843
 ] 

Michael Park commented on MESOS-7316:
-

[~adam-mesos]: Backported to 1.2.x.

> Upgrading Mesos to 1.2.0 results in some information missing from the 
> `/flags` endpoint.
> 
>
> Key: MESOS-7316
> URL: https://issues.apache.org/jira/browse/MESOS-7316
> Project: Mesos
>  Issue Type: Bug
>  Components: HTTP API
>Affects Versions: 1.2.0
>Reporter: Anand Mazumdar
>Assignee: Benjamin Bannier
>Priority: Critical
>  Labels: mesosphere
> Fix For: 1.2.1, 1.3.0
>
>
> From OSS Mesos Slack:
> I recently tried upgrading one of our Mesos clusters from 1.1.0 to 1.2.0. 
> After doing this, it looks like the {{zk}} field on the {{/master/flags}} 
> endpoint is no longer present. 
> This looks related to the recent {{Flags}} refactoring that was done which 
> resulted in some flags no longer being populated since they were not part of 
> {{master::Flags}} in {{src/master/flags.hpp}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7376) Long registry updates when the number of agents is high

2017-04-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971824#comment-15971824
 ] 

Benjamin Mahler commented on MESOS-7376:


Yes, I will shepherd, thanks for taking this on!

> Long registry updates when the number of agents is high
> ---
>
> Key: MESOS-7376
> URL: https://issues.apache.org/jira/browse/MESOS-7376
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Critical
>
> During scale testing we discovered that as the number of registered agents 
> grows the time it takes to update the registry grows to unacceptable values 
> very fast. At some point it starts exceeding {{registry_store_timeout}} which 
> doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
> 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
> 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
> position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
> finished in 2.428189561secs (b=1)
> I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the 
> registry in 34.971944192secs
> {noformat}
> This is caused by repeated {{Registry}} copying which involves copying a big 
> object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-5172) Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.

2017-04-17 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5172:
--
Target Version/s: 1.1.2, 1.2.1, 1.3.0  (was: 1.2.1, 1.3.0)

> Registry puller cannot fetch blobs correctly from http Redirect 3xx urls.
> -
>
> Key: MESOS-5172
> URL: https://issues.apache.org/jira/browse/MESOS-5172
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: containerizer, mesosphere
> Fix For: 1.3.0
>
>
> When the registry puller is pulling a private repository from some private 
> registry (e.g., quay.io), errors may occur when fetching blobs, at which 
> point fetching the manifest of the repo is finished correctly. The error 
> message is `Unexpected HTTP response '400 Bad Request' when trying to 
> download the blob`. This may arise from the logic of fetching blobs, or 
> incorrect format of uri when requesting blobs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-6975) Prevent pre-1.0 agents from registering with 1.3+ master.

2017-04-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6975?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-6975:
---
Summary: Prevent pre-1.0 agents from registering with 1.3+ master.  (was: 
Prevent old Mesos agents from registering)

> Prevent pre-1.0 agents from registering with 1.3+ master.
> -
>
> Key: MESOS-6975
> URL: https://issues.apache.org/jira/browse/MESOS-6975
> Project: Mesos
>  Issue Type: Epic
>  Components: master
>Reporter: Neil Conway
>Assignee: Neil Conway
>
> https://www.mail-archive.com/dev@mesos.apache.org/msg37194.html



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7387) ZK master contender and detector don't respect zk_session_timeout option

2017-04-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971814#comment-15971814
 ] 

Benjamin Mahler commented on MESOS-7387:


Looks like Vinod is shepherding, thanks Vinod.

> ZK master contender and detector don't respect zk_session_timeout option
> 
>
> Key: MESOS-7387
> URL: https://issues.apache.org/jira/browse/MESOS-7387
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Minor
>
> {{ZooKeeperMasterContender}} and {{ZooKeeperMasterDetector}} are using 
> hardcoded ZK session timeouts ({{MASTER_CONTENDER_ZK_SESSION_TIMEOUT}} and 
> {{MASTER_DETECTOR_ZK_SESSION_TIMEOUT}}) and do not respect 
> {{--zk_session_timeout}} master option. This is unexpected and doesn't play 
> well with ZK updates that take longer than 10 secs.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7265) Containerizer startup may cause sensitive data to leak into sandbox logs.

2017-04-17 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-7265:
--
Target Version/s: 1.1.2, 1.2.1, 1.0.4  (was: 1.2.1)

> Containerizer startup may cause sensitive data to leak into sandbox logs.
> -
>
> Key: MESOS-7265
> URL: https://issues.apache.org/jira/browse/MESOS-7265
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Affects Versions: 1.2.0
>Reporter: Till Toenshoff
>Assignee: Till Toenshoff
>  Labels: mesosphere
> Fix For: 1.2.1, 1.3.0
>
>
> The task sandbox logging does show the callup for the containerizer launch 
> with all of its flags.
> This is not safe when assuming that we may not want to leak sensitive data 
> into the sandbox logging.
> Example:
> {noformat}
> Received SUBSCRIBED event
> Subscribed executor on lobomacpro2.fritz.box
> Received LAUNCH event
> Starting task test
> /Users/till/Development/mesos-private/build/src/mesos-containerizer launch 
> --help="false" 
> --launch_info="{"command":{"environment":{"variables":[{"name":"key1","type":"VALUE","value":"value1"}]},"shell":true,"value":"sleep
>  
> 1000"},"environment":{"variables":[{"name":"BIN_SH","type":"VALUE","value":"xpg4"},{"name":"DUALCASE","type":"VALUE","value":"1"},{"name":"DYLD_LIBRARY_PATH","type":"VALUE","value":"\/Users\/till\/Development\/mesos-private\/build\/src\/.libs"},{"name":"LIBPROCESS_PORT","type":"VALUE","value":"0"},{"name":"MESOS_AGENT_ENDPOINT","type":"VALUE","value":"192.168.178.20:5051"},{"name":"MESOS_CHECKPOINT","type":"VALUE","value":"0"},{"name":"MESOS_DIRECTORY","type":"VALUE","value":"\/tmp\/mesos\/slaves\/816619b6-f5ce-42d6-ad6b-2ef2001adc0a-S0\/frameworks\/4c8a82d4-8a5b-47f5-a660-5fef15da71a5-\/executors\/test\/runs\/b4bd0251-b42a-4ab3-9f02-60ede75bf3b1"},{"name":"MESOS_EXECUTOR_ID","type":"VALUE","value":"test"},{"name":"MESOS_EXECUTOR_SHUTDOWN_GRACE_PERIOD","type":"VALUE","value":"5secs"},{"name":"MESOS_FRAMEWORK_ID","type":"VALUE","value":"4c8a82d4-8a5b-47f5-a660-5fef15da71a5-"},{"name":"MESOS_HTTP_COMMAND_EXECUTOR","type":"VALUE","value":"0"},{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/tmp\/mesos\/slaves\/816619b6-f5ce-42d6-ad6b-2ef2001adc0a-S0\/frameworks\/4c8a82d4-8a5b-47f5-a660-5fef15da71a5-\/executors\/test\/runs\/b4bd0251-b42a-4ab3-9f02-60ede75bf3b1"},{"name":"MESOS_SLAVE_ID","type":"VALUE","value":"816619b6-f5ce-42d6-ad6b-2ef2001adc0a-S0"},{"name":"MESOS_SLAVE_PID","type":"VALUE","value":"slave(1)@192.168.178.20:5051"},{"name":"PATH","type":"VALUE","value":"\/usr\/local\/sbin:\/usr\/local\/bin:\/usr\/sbin:\/usr\/bin:\/sbin:\/bin"},{"name":"PWD","type":"VALUE","value":"\/private\/tmp\/mesos\/slaves\/816619b6-f5ce-42d6-ad6b-2ef2001adc0a-S0\/frameworks\/4c8a82d4-8a5b-47f5-a660-5fef15da71a5-\/executors\/test\/runs\/b4bd0251-b42a-4ab3-9f02-60ede75bf3b1"},{"name":"SHLVL","type":"VALUE","value":"0"},{"name":"__CF_USER_TEXT_ENCODING","type":"VALUE","value":"0x1F5:0x0:0x0"},{"name":"key1","type":"VALUE","value":"value1"},{"name":"key1","type":"VALUE","value":"value1"}]}}"
> Forked command at 16329
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7308) Race condition in `updateAllocation()` on DESTORY of a shared volume.

2017-04-17 Thread Anindya Sinha (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anindya Sinha updated MESOS-7308:
-
Shepherd: Yan Xu

> Race condition in `updateAllocation()` on DESTORY of a shared volume.
> -
>
> Key: MESOS-7308
> URL: https://issues.apache.org/jira/browse/MESOS-7308
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Reporter: Anindya Sinha
>Assignee: Anindya Sinha
>  Labels: persistent-volumes
>
> When a {{DESTROY}} (for shared volume) is processed in the master actor, we 
> rescind pending offers to which the volume to be destroyed is already offered 
> to. Before allocator executes the {{updateAllocation()}} API, offers with the 
> same shared volume can be sent to frameworks since the destroyed shared 
> volume is not removed from {{slaves.total}} till {{updateAllocation()}} 
> completes. As a result, the following check can fail:
> {code}
>   CHECK_EQ(
>   frameworkAllocation.flatten().createStrippedScalarQuantity(),
>   updatedFrameworkAllocation.flatten().createStrippedScalarQuantity());
> {code}
> We need to address this condition by not failing the {{CHECK_EQ}}, and also 
> ensuring that the master's state is restored to honor the {{DESTROY}} of the 
> shared volume.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7392) Obfuscate authentication information logged by the fetcher

2017-04-17 Thread Vishnu Mohan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971749#comment-15971749
 ] 

Vishnu Mohan commented on MESOS-7392:
-

{code}
Fetched 
'https://username:s00pers3cretpassw...@repo.sfiqautomation.com/artifactory/libs-release-local/com/salesforceiq/graph-spark_2.11/0.0.7/graph-spark-fatjar.jar'
 to 
'/var/lib/mesos/slave/slaves/a5534cb6-89db-4a0a-af48-a1a8a9efa964-S8/frameworks/a5534cb6-89db-4a0a-af48-a1a8a9efa964-0007/executors/driver-20170417222104-0002/runs/028c75e8-647e-4cd6-9dd6-6e834e0fcebc/graph-spark-fatjar.jar'
{code}
Ref: 
https://dcos-community.slack.com/archives/C10DCMHK4/p1492467766855542?thread_ts=1492196251.988127=C10DCMHK4

> Obfuscate authentication information logged by the fetcher 
> ---
>
> Key: MESOS-7392
> URL: https://issues.apache.org/jira/browse/MESOS-7392
> Project: Mesos
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.0.3, 1.1.1, 1.2.0
>Reporter: Vishnu Mohan
>
> As reported by Joseph Stevens on DC/OS Community Slack: 
> https://dcos-community.slack.com/archives/C10DCMHK4/p1492126723695465
> {code}
> So I've noticed that the Mesos Fetcher prints the URI it's using in plain 
> text to the stderr logs. This is a serious problem since if you're using 
> something like the mesos spark framework, it uses mesos fetcher under the 
> hood, and the only way to fetch authenticated resources is to pass the auth 
> as part of the URI. This means every time we start a job we're printing a 
> username and password into the task sandbox and consequently into anything 
> that picks up those logs from the agents. Could you guys change that so the 
> password is obfuscated on print when a URI has credentials inside it?
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-2537) AC_ARG_ENABLED checks are broken

2017-04-17 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-2537:
--
Fix Version/s: 1.0.4

> AC_ARG_ENABLED checks are broken
> 
>
> Key: MESOS-2537
> URL: https://issues.apache.org/jira/browse/MESOS-2537
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.3, 1.1.1, 1.1.2
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
> Fix For: 1.1.2, 1.2.0, 1.0.4
>
>
> In a number of places, the Mesos configure script passes "$foo=yes" to the 
> 2nd argument of {{AC_ARG_ENABLED}}. However, the 2nd argument is invoked when 
> the option is provided in any form, not just when the {{\--enable-foo}} form 
> is used. One result of this is that {{\--disable-optimize}} doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-2537) AC_ARG_ENABLED checks are broken

2017-04-17 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-2537:
--
Affects Version/s: 1.0.4

> AC_ARG_ENABLED checks are broken
> 
>
> Key: MESOS-2537
> URL: https://issues.apache.org/jira/browse/MESOS-2537
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.3, 1.1.1, 1.1.2
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
> Fix For: 1.1.2, 1.2.0
>
>
> In a number of places, the Mesos configure script passes "$foo=yes" to the 
> 2nd argument of {{AC_ARG_ENABLED}}. However, the 2nd argument is invoked when 
> the option is provided in any form, not just when the {{\--enable-foo}} form 
> is used. One result of this is that {{\--disable-optimize}} doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-2537) AC_ARG_ENABLED checks are broken

2017-04-17 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-2537:
--
Affects Version/s: (was: 1.0.4)

> AC_ARG_ENABLED checks are broken
> 
>
> Key: MESOS-2537
> URL: https://issues.apache.org/jira/browse/MESOS-2537
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.0.3, 1.1.1, 1.1.2
>Reporter: James Peach
>Assignee: James Peach
>Priority: Minor
> Fix For: 1.1.2, 1.2.0
>
>
> In a number of places, the Mesos configure script passes "$foo=yes" to the 
> 2nd argument of {{AC_ARG_ENABLED}}. However, the 2nd argument is invoked when 
> the option is provided in any form, not just when the {{\--enable-foo}} form 
> is used. One result of this is that {{\--disable-optimize}} doesn't work.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7280) Unified containerizer provisions docker image error with COPY backend

2017-04-17 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao updated MESOS-7280:
---
Priority: Critical  (was: Major)

> Unified containerizer provisions docker image error with COPY backend
> -
>
> Key: MESOS-7280
> URL: https://issues.apache.org/jira/browse/MESOS-7280
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.2, 1.2.0
> Environment: CentOS 7.2,ext4, COPY
>Reporter: depay
>Assignee: Chun-Hung Hsiao
>Priority: Critical
>  Labels: copy-backend
>
> Error occurs on some specific docker images with COPY backend, both 1.0.2 and 
> 1.2.0. It works well with OVERLAY backend on 1.2.0.
> {quote}
> I0321 09:36:07.308830 27613 paths.cpp:528] Trying to chown 
> '/data/mesos/slaves/55f6df5e-2812-40a0-baf5-ce96f20677d3-S102/frameworks/20151223-150303-2677017098-5050-30032-/executors/ct:Transcoding_Test_114489497_1490060156172:3/runs/7e518538-7b56-4b14-a3c9-bee43c669bd7'
>  to user 'root'
> I0321 09:36:07.319628 27613 slave.cpp:5703] Launching executor 
> ct:Transcoding_Test_114489497_1490060156172:3 of framework 
> 20151223-150303-2677017098-5050-30032- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/data/mesos/slaves/55f6df5e-2812-40a0-baf5-ce96f20677d3-S102/frameworks/20151223-150303-2677017098-5050-30032-/executors/ct:Transcoding_Test_114489497_1490060156172:3/runs/7e518538-7b56-4b14-a3c9-bee43c669bd7'
> I0321 09:36:07.321436 27615 containerizer.cpp:781] Starting container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7' for executor 
> 'ct:Transcoding_Test_114489497_1490060156172:3' of framework 
> '20151223-150303-2677017098-5050-30032-'
> I0321 09:36:37.902195 27600 provisioner.cpp:294] Provisioning image rootfs 
> '/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9'
>  for container 7e518538-7b56-4b14-a3c9-bee43c669bd7
> *E0321 09:36:58.707718 27606 slave.cpp:4000] Container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7' for executor 
> 'ct:Transcoding_Test_114489497_1490060156172:3' of framework 
> 20151223-150303-2677017098-5050-30032- failed to start: Collect failed: 
> Failed to copy layer: cp: cannot create regular file 
> ‘/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9/usr/bin/python’:
>  Text file busy*
> I0321 09:36:58.707991 27608 containerizer.cpp:1622] Destroying container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7'
> I0321 09:36:58.708468 27607 provisioner.cpp:434] Destroying container rootfs 
> at 
> '/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9'
>  for container 7e518538-7b56-4b14-a3c9-bee43c669bd7
> {quote}
> Docker image is a private one, so that i have to try to reproduce this bug 
> with some sample Dockerfile as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7376) Long registry updates when the number of agents is high

2017-04-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7376?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7376:
---
Shepherd: Benjamin Mahler

> Long registry updates when the number of agents is high
> ---
>
> Key: MESOS-7376
> URL: https://issues.apache.org/jira/browse/MESOS-7376
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Affects Versions: 1.3.0
>Reporter: Ilya Pronin
>Assignee: Ilya Pronin
>Priority: Critical
>
> During scale testing we discovered that as the number of registered agents 
> grows the time it takes to update the registry grows to unacceptable values 
> very fast. At some point it starts exceeding {{registry_store_timeout}} which 
> doesn't fire.
> With 55k agents we saw this ({{registry_store_timeout=20secs}}):
> {noformat}
> I0331 17:11:21.227442 36472 registrar.cpp:473] Applied 69 operations in 
> 3.138843387secs; attempting to update the registry
> I0331 17:11:24.441409 36464 log.cpp:529] LogStorage.set: acquired the lock in 
> 74461ns
> I0331 17:11:24.441541 36464 log.cpp:543] LogStorage.set: started in 51770ns
> I0331 17:11:26.869323 36462 log.cpp:628] LogStorage.set: wrote append at 
> position=6420881 in 2.41043644secs
> I0331 17:11:26.869454 36462 state.hpp:179] State.store: storage.set has 
> finished in 2.428189561secs (b=1)
> I0331 17:11:56.199453 36469 registrar.cpp:518] Successfully updated the 
> registry in 34.971944192secs
> {noformat}
> This is caused by repeated {{Registry}} copying which involves copying a big 
> object graph that takes roughly 0.4 sec (with 55k agents).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-5417) define WSTRINGIFY behaviour on Windows

2017-04-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971610#comment-15971610
 ] 

Joseph Wu commented on MESOS-5417:
--

The above makes {{WSTRINGIFY}} a noop, as opposed to having {{WSTRINGIFY}} 
actually return something meaningful.  So there is more to do.

> define WSTRINGIFY behaviour on Windows
> --
>
> Key: MESOS-5417
> URL: https://issues.apache.org/jira/browse/MESOS-5417
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Daniel Pravat
>Assignee: Li Li
>Priority: Minor
>  Labels: windows
>
> Identify the proper behaviour of WSTRINGIFY to improve the logging.  



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7395) Benchmark performance of hierarchical roles

2017-04-17 Thread Neil Conway (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7395?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Neil Conway updated MESOS-7395:
---
Shepherd: Neil Conway

> Benchmark performance of hierarchical roles
> ---
>
> Key: MESOS-7395
> URL: https://issues.apache.org/jira/browse/MESOS-7395
> Project: Mesos
>  Issue Type: Task
>Reporter: Neil Conway
>Assignee: Jay Guo
>  Labels: mesosphere
>
> Write a unit test/benchmark to measure the performance of the 
> sorter/allocator for hierarchical roles.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7395) Benchmark performance of hierarchical roles

2017-04-17 Thread Neil Conway (JIRA)
Neil Conway created MESOS-7395:
--

 Summary: Benchmark performance of hierarchical roles
 Key: MESOS-7395
 URL: https://issues.apache.org/jira/browse/MESOS-7395
 Project: Mesos
  Issue Type: Task
Reporter: Neil Conway
Assignee: Jay Guo


Write a unit test/benchmark to measure the performance of the sorter/allocator 
for hierarchical roles.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7078) Benchmarks to validate perf impact of hierarchical sorting

2017-04-17 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971479#comment-15971479
 ] 

Neil Conway commented on MESOS-7078:


[~guoger] -- right, we expect that the performance of the initial 
implementation of h-roles will definitely be worse than for a flat list of 
roles. We can look at improving this down the road, but creating a benchmark is 
probably a good first step. I created MESOS-7395 and assigned it to you -- 
thank you!

> Benchmarks to validate perf impact of hierarchical sorting
> --
>
> Key: MESOS-7078
> URL: https://issues.apache.org/jira/browse/MESOS-7078
> Project: Mesos
>  Issue Type: Task
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Depending on how deeply we need to change the sorter/allocator, we should 
> ensure we take the time to run the existing benchmarks (and perhaps write new 
> benchmarks) to ensure we don't regress performance for existing 
> sorter/allocator use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Assigned] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2017-04-17 Thread Chun-Hung Hsiao (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chun-Hung Hsiao reassigned MESOS-6004:
--

Assignee: Chun-Hung Hsiao

>  Tasks fail when provisioning multiple containers with large docker images 
> using copy backend
> -
>
> Key: MESOS-6004
> URL: https://issues.apache.org/jira/browse/MESOS-6004
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.2, 1.0.0
> Environment: h4. Agent Platform
> - Ubuntu 16.04
> - AWS g2.x2large instance
> - Nvidia support enabled
> h4. Agent Configuration
> -{noformat}
> --containerizers=mesos,docker
> --docker_config=
> --docker_store_dir=/mnt/mesos/store/docker
> --executor_registration_timeout=3mins
> --hostname=
> --image_providers=docker
> --image_provisioner_backend=copy
> --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
> --switch_user=false
> --work_dir=/mnt/mesos
> {noformat}
> h4. Framework
> - custom framework written in python
> - using unified containerizer with docker images
> h4. Test Setup
> * 1 master
> * 1 agent
> * 5 tasks scheduled at the same time:
> ** resources: cpus: 0.1, mem: 128
> ** command: `echo test`
> ** docker image: custom docker image, based on nvidia/cuda ~5gb
> ** the same docker image was for all tasks, already pulled.
>Reporter: Michael Thomas
>Assignee: Chun-Hung Hsiao
>  Labels: containerizer, docker, performance
>
> When scheduling more than one task on the same agent, all tasks fail a as 
> containers seem to be destroyed during provisioning.
> Specifically, the errors on the agent logs are:
> {noformat}
>  E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
> 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
> c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
> destroyed during provisioning
> {noformat}
> and 
> {noformat}
> I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
> framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
> register within 3mins
> {noformat}
> As the default provisioning method {{copy}} is being used, I assume this is 
> due to the provisioning of multiple containers taking too long and the agent 
> will not wait. For large images, this method is simply not performant.
> The issue did not occur, when only one tasks was scheduled.
> Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
> bit as it allowed to schedule at least 2 tasks at the same time. But still 
> fails with more (5 in this case)
> h4. Complete logs
> (with GLOG_v=1)
> {noformat}
> Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375  
> 3738 slave.cpp:198] Agent started on 1)@172.31.23.17:5051
> Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403  
> 3738 slave.cpp:199] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos,docker" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" 
> --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
> --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
> --docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp
> Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 
> oint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" 
> --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" --image_providers="docker" 
> --image_provisioner_backend="copy" --initialize_driver_logging="true" 
> --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
> --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" 
> --logbufsecs="0" --logging_level="INFO" 
> --master="zk://172.31.19.240:2181/mesos" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" 

[jira] [Updated] (MESOS-7394) libprocess test failures double-free a stack-allocated Process.

2017-04-17 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-7394:
---
Summary: libprocess test failures double-free a stack-allocated Process.  
(was: libprocess test failures double-free stack-allocated processes)

> libprocess test failures double-free a stack-allocated Process.
> ---
>
> Key: MESOS-7394
> URL: https://issues.apache.org/jira/browse/MESOS-7394
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, tests
>Reporter: James Peach
>
> Some of the {{libprocess}} tests will allocate a {{Process}} on the stack and 
> then {{wait}} on that at the end of the test. If the test fails before the 
> process it waited on, however, {{gtest}} returns from the test function, 
> causing the stack variable to be deallocated. However, there is still a 
> pointer to in the {{libprocess}} {{ProcessManager}}, so when {{libprocess}} 
> finalizes, we end up throwing exceptions because the stack variable is 
> trashed.
> For example:
> {code}
> #0  0x743a291f in raise () from /lib64/libc.so.6
> #1  0x743a451a in abort () from /lib64/libc.so.6
> #2  0x74ce452d in __gnu_cxx::__verbose_terminate_handler() () from 
> /lib64/libstdc++.so.6
> #3  0x74ce22d6 in ?? () from /lib64/libstdc++.so.6
> #4  0x74ce2321 in std::terminate() () from /lib64/libstdc++.so.6
> #5  0x74ce2539 in __cxa_throw () from /lib64/libstdc++.so.6
> #6  0x74d0c02f in std::__throw_length_error(char const*) () from 
> /lib64/libstdc++.so.6
> #7  0x74d76a5c in std::__cxx11::basic_string std::char_traits, std::allocator >::_M_create(unsigned long&, 
> unsigned long) () from /lib64/libstdc++.so.6
> #8  0x74d794ed in void std::__cxx11::basic_string std::char_traits, std::allocator >::_M_construct(char*, 
> char*, std::forward_iterator_tag) () from /lib64/libstdc++.so.6
> #9  0x74d7954f in std::__cxx11::basic_string std::char_traits, std::allocator 
> >::basic_string(std::__cxx11::basic_string std::allocator > const&)
> () from /lib64/libstdc++.so.6
> #10 0x0041d413 in process::UPID::UPID (this=0x7fffdd00, that=...) 
> at ../../../3rdparty/libprocess/include/process/pid.hpp:44
> #11 0x00430d86 in process::ProcessBase::self (this=0x7fffd338) at 
> ../../../3rdparty/libprocess/include/process/process.hpp:76
> #12 0x0087d62e in process::ProcessManager::finalize (this=0xdc5c50) 
> at ../../../3rdparty/libprocess/src/process.cpp:2682
> #13 0x008749f7 in process::finalize (finalize_wsa=true) at 
> ../../../3rdparty/libprocess/src/process.cpp:1316
> #14 0x005dae1e in main (argc=1, argv=0x7fffdfe8) at 
> ../../../3rdparty/libprocess/src/tests/main.cpp:82
> {code}
> An example test is {{ProcessTest.Http1}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7394) libprocess test failures double-free stack-allocated processes

2017-04-17 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-7394:
---
Component/s: tests
 libprocess

> libprocess test failures double-free stack-allocated processes
> --
>
> Key: MESOS-7394
> URL: https://issues.apache.org/jira/browse/MESOS-7394
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess, tests
>Reporter: James Peach
>
> Some of the {{libprocess}} tests will allocate a {{Process}} on the stack and 
> then {{wait}} on that at the end of the test. If the test fails before the 
> process it waited on, however, {{gtest}} returns from the test function, 
> causing the stack variable to be deallocated. However, there is still a 
> pointer to in the {{libprocess}} {{ProcessManager}}, so when {{libprocess}} 
> finalizes, we end up throwing exceptions because the stack variable is 
> trashed.
> For example:
> {code}
> #0  0x743a291f in raise () from /lib64/libc.so.6
> #1  0x743a451a in abort () from /lib64/libc.so.6
> #2  0x74ce452d in __gnu_cxx::__verbose_terminate_handler() () from 
> /lib64/libstdc++.so.6
> #3  0x74ce22d6 in ?? () from /lib64/libstdc++.so.6
> #4  0x74ce2321 in std::terminate() () from /lib64/libstdc++.so.6
> #5  0x74ce2539 in __cxa_throw () from /lib64/libstdc++.so.6
> #6  0x74d0c02f in std::__throw_length_error(char const*) () from 
> /lib64/libstdc++.so.6
> #7  0x74d76a5c in std::__cxx11::basic_string std::char_traits, std::allocator >::_M_create(unsigned long&, 
> unsigned long) () from /lib64/libstdc++.so.6
> #8  0x74d794ed in void std::__cxx11::basic_string std::char_traits, std::allocator >::_M_construct(char*, 
> char*, std::forward_iterator_tag) () from /lib64/libstdc++.so.6
> #9  0x74d7954f in std::__cxx11::basic_string std::char_traits, std::allocator 
> >::basic_string(std::__cxx11::basic_string std::allocator > const&)
> () from /lib64/libstdc++.so.6
> #10 0x0041d413 in process::UPID::UPID (this=0x7fffdd00, that=...) 
> at ../../../3rdparty/libprocess/include/process/pid.hpp:44
> #11 0x00430d86 in process::ProcessBase::self (this=0x7fffd338) at 
> ../../../3rdparty/libprocess/include/process/process.hpp:76
> #12 0x0087d62e in process::ProcessManager::finalize (this=0xdc5c50) 
> at ../../../3rdparty/libprocess/src/process.cpp:2682
> #13 0x008749f7 in process::finalize (finalize_wsa=true) at 
> ../../../3rdparty/libprocess/src/process.cpp:1316
> #14 0x005dae1e in main (argc=1, argv=0x7fffdfe8) at 
> ../../../3rdparty/libprocess/src/tests/main.cpp:82
> {code}
> An example test is {{ProcessTest.Http1}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7394) libprocess test failures double-free stack-allocated processes

2017-04-17 Thread James Peach (JIRA)
James Peach created MESOS-7394:
--

 Summary: libprocess test failures double-free stack-allocated 
processes
 Key: MESOS-7394
 URL: https://issues.apache.org/jira/browse/MESOS-7394
 Project: Mesos
  Issue Type: Bug
Reporter: James Peach


Some of the {{libprocess}} tests will allocate a {{Process}} on the stack and 
then {{wait}} on that at the end of the test. If the test fails before the 
process it waited on, however, {{gtest}} returns from the test function, 
causing the stack variable to be deallocated. However, there is still a pointer 
to in the {{libprocess}} {{ProcessManager}}, so when {{libprocess}} finalizes, 
we end up throwing exceptions because the stack variable is trashed.

For example:
{code}
#0  0x743a291f in raise () from /lib64/libc.so.6
#1  0x743a451a in abort () from /lib64/libc.so.6
#2  0x74ce452d in __gnu_cxx::__verbose_terminate_handler() () from 
/lib64/libstdc++.so.6
#3  0x74ce22d6 in ?? () from /lib64/libstdc++.so.6
#4  0x74ce2321 in std::terminate() () from /lib64/libstdc++.so.6
#5  0x74ce2539 in __cxa_throw () from /lib64/libstdc++.so.6
#6  0x74d0c02f in std::__throw_length_error(char const*) () from 
/lib64/libstdc++.so.6
#7  0x74d76a5c in std::__cxx11::basic_string::_M_create(unsigned long&, 
unsigned long) () from /lib64/libstdc++.so.6
#8  0x74d794ed in void std::__cxx11::basic_string::_M_construct(char*, 
char*, std::forward_iterator_tag) () from /lib64/libstdc++.so.6
#9  0x74d7954f in std::__cxx11::basic_string::basic_string(std::__cxx11::basic_string const&)
() from /lib64/libstdc++.so.6
#10 0x0041d413 in process::UPID::UPID (this=0x7fffdd00, that=...) 
at ../../../3rdparty/libprocess/include/process/pid.hpp:44
#11 0x00430d86 in process::ProcessBase::self (this=0x7fffd338) at 
../../../3rdparty/libprocess/include/process/process.hpp:76
#12 0x0087d62e in process::ProcessManager::finalize (this=0xdc5c50) at 
../../../3rdparty/libprocess/src/process.cpp:2682
#13 0x008749f7 in process::finalize (finalize_wsa=true) at 
../../../3rdparty/libprocess/src/process.cpp:1316
#14 0x005dae1e in main (argc=1, argv=0x7fffdfe8) at 
../../../3rdparty/libprocess/src/tests/main.cpp:82
{code}

An example test is {{ProcessTest.Http1}}.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-17 Thread Megha Sharma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971344#comment-15971344
 ] 

Megha Sharma commented on MESOS-6223:
-

[~neilc] on it, I am looking into the test failure. Should have the patch ready 
really soon.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-17 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971303#comment-15971303
 ] 

Neil Conway edited comment on MESOS-6223 at 4/17/17 4:32 PM:
-

[~xds2000] -- There is a known test failure that AFAIK hasn't been resolved yet 
(details are on ReviewBoard). I'm waiting for that to be addressed before I dig 
into these changes more deeply -- but I'd like to get this change wrapped up 
and shipped pretty soon. cc [~megha.sharma] [~xujyan]


was (Author: neilc):
[~xds2000] -- There is a known test failure that AFAIK hasn't been resolved yet 
(details are on ReviewBoard). I'm waiting for that to be addressed before I dig 
into these changes more deeply -- but I'd like to get this change wrapped up 
and shipped pretty soon.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-17 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971303#comment-15971303
 ] 

Neil Conway commented on MESOS-6223:


[~xds2000] -- There is a known test failure that AFAIK hasn't been resolved yet 
(details are on ReviewBoard). I'm waiting for that to be addressed before I dig 
into these changes more deeply -- but I'd like to get this change wrapped up 
and shipped pretty soon.

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Updated] (MESOS-7393) Make subversion an optional dependency.

2017-04-17 Thread James Peach (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7393?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Peach updated MESOS-7393:
---
Description: AFAICT the {{mesos-master}} and {{mesos-agent}} themselves do 
no use the replicated log features that require libsvn support. To reduce the 
number of Mesos dependencies we could make libsvn a build-time option.  (was: 
AFAICT the {{mesas-master}} and {{mesas-agent}} themselves do no use the 
replicated log features that require libsvn support. To reduce the number of 
Mesos dependencies we could make libsvn a build-time option.)

> Make subversion an optional dependency.
> ---
>
> Key: MESOS-7393
> URL: https://issues.apache.org/jira/browse/MESOS-7393
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: James Peach
>Priority: Minor
>
> AFAICT the {{mesos-master}} and {{mesos-agent}} themselves do no use the 
> replicated log features that require libsvn support. To reduce the number of 
> Mesos dependencies we could make libsvn a build-time option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Created] (MESOS-7393) Make subversion an optional dependency.

2017-04-17 Thread James Peach (JIRA)
James Peach created MESOS-7393:
--

 Summary: Make subversion an optional dependency.
 Key: MESOS-7393
 URL: https://issues.apache.org/jira/browse/MESOS-7393
 Project: Mesos
  Issue Type: Bug
  Components: build
Reporter: James Peach
Priority: Minor


AFAICT the {{mesas-master}} and {{mesas-agent}} themselves do no use the 
replicated log features that require libsvn support. To reduce the number of 
Mesos dependencies we could make libsvn a build-time option.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7280) Unified containerizer provisions docker image error with COPY backend

2017-04-17 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15971095#comment-15971095
 ] 

Chun-Hung Hsiao commented on MESOS-7280:


Can you describe more about how the files are linked? Is it like that 
/usr/bin/python links to some 2.6 binary in a lower layer and then it is 
changed to link to some 2.7 binary in an upper layer? I'm suspecting that this 
bug might be related to how symbolic links are handled in the copy backend.

> Unified containerizer provisions docker image error with COPY backend
> -
>
> Key: MESOS-7280
> URL: https://issues.apache.org/jira/browse/MESOS-7280
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 1.0.2, 1.2.0
> Environment: CentOS 7.2,ext4, COPY
>Reporter: depay
>Assignee: Chun-Hung Hsiao
>  Labels: copy-backend
>
> Error occurs on some specific docker images with COPY backend, both 1.0.2 and 
> 1.2.0. It works well with OVERLAY backend on 1.2.0.
> {quote}
> I0321 09:36:07.308830 27613 paths.cpp:528] Trying to chown 
> '/data/mesos/slaves/55f6df5e-2812-40a0-baf5-ce96f20677d3-S102/frameworks/20151223-150303-2677017098-5050-30032-/executors/ct:Transcoding_Test_114489497_1490060156172:3/runs/7e518538-7b56-4b14-a3c9-bee43c669bd7'
>  to user 'root'
> I0321 09:36:07.319628 27613 slave.cpp:5703] Launching executor 
> ct:Transcoding_Test_114489497_1490060156172:3 of framework 
> 20151223-150303-2677017098-5050-30032- with resources cpus(*):0.1; 
> mem(*):32 in work directory 
> '/data/mesos/slaves/55f6df5e-2812-40a0-baf5-ce96f20677d3-S102/frameworks/20151223-150303-2677017098-5050-30032-/executors/ct:Transcoding_Test_114489497_1490060156172:3/runs/7e518538-7b56-4b14-a3c9-bee43c669bd7'
> I0321 09:36:07.321436 27615 containerizer.cpp:781] Starting container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7' for executor 
> 'ct:Transcoding_Test_114489497_1490060156172:3' of framework 
> '20151223-150303-2677017098-5050-30032-'
> I0321 09:36:37.902195 27600 provisioner.cpp:294] Provisioning image rootfs 
> '/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9'
>  for container 7e518538-7b56-4b14-a3c9-bee43c669bd7
> *E0321 09:36:58.707718 27606 slave.cpp:4000] Container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7' for executor 
> 'ct:Transcoding_Test_114489497_1490060156172:3' of framework 
> 20151223-150303-2677017098-5050-30032- failed to start: Collect failed: 
> Failed to copy layer: cp: cannot create regular file 
> ‘/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9/usr/bin/python’:
>  Text file busy*
> I0321 09:36:58.707991 27608 containerizer.cpp:1622] Destroying container 
> '7e518538-7b56-4b14-a3c9-bee43c669bd7'
> I0321 09:36:58.708468 27607 provisioner.cpp:434] Destroying container rootfs 
> at 
> '/data/mesos/provisioner/containers/7e518538-7b56-4b14-a3c9-bee43c669bd7/backends/copy/rootfses/8d2f7fe8-71ff-4317-a33c-a436241a93d9'
>  for container 7e518538-7b56-4b14-a3c9-bee43c669bd7
> {quote}
> Docker image is a private one, so that i have to try to reproduce this bug 
> with some sample Dockerfile as possible.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-6223) Allow agents to re-register post a host reboot

2017-04-17 Thread Deshi Xiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970939#comment-15970939
 ] 

Deshi Xiao commented on MESOS-6223:
---

[~neilc]  do you have any update on this patch: 
https://reviews.apache.org/r/56895/

> Allow agents to re-register post a host reboot
> --
>
> Key: MESOS-6223
> URL: https://issues.apache.org/jira/browse/MESOS-6223
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent
>Reporter: Megha Sharma
>Assignee: Megha Sharma
>
> Agent does’t recover its state post a host reboot, it registers with the 
> master and gets a new SlaveID. With partition awareness, the agents are now 
> allowed to re-register after they have been marked Unreachable. The executors 
> are anyway terminated on the agent when it reboots so there is no harm in 
> letting the agent keep its SlaveID, re-register with the master and reconcile 
> the lost executors. This is a pre-requisite for supporting 
> persistent/restartable tasks in mesos (MESOS-3545).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7078) Benchmarks to validate perf impact of hierarchical sorting

2017-04-17 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970876#comment-15970876
 ] 

Jay Guo edited comment on MESOS-7078 at 4/17/17 9:04 AM:
-

[~neilc] I built a tree of client in {{Sorter_BENCHMARK_Test.FullSort}} and the 
performance downgrades pretty badly. I guess it may be inevitable due to tree 
traversal. Should I add this test to capture it?


was (Author: guoger):
I built a tree of client in {{Sorter_BENCHMARK_Test.FullSort}} and the 
performance downgrades pretty badly. I guess it may be inevitable due to tree 
traversal. Should I add this test to capture it?

> Benchmarks to validate perf impact of hierarchical sorting
> --
>
> Key: MESOS-7078
> URL: https://issues.apache.org/jira/browse/MESOS-7078
> Project: Mesos
>  Issue Type: Task
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Depending on how deeply we need to change the sorter/allocator, we should 
> ensure we take the time to run the existing benchmarks (and perhaps write new 
> benchmarks) to ensure we don't regress performance for existing 
> sorter/allocator use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Comment Edited] (MESOS-7078) Benchmarks to validate perf impact of hierarchical sorting

2017-04-17 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970876#comment-15970876
 ] 

Jay Guo edited comment on MESOS-7078 at 4/17/17 9:03 AM:
-

I built a tree of client in {{Sorter_BENCHMARK_Test.FullSort}} and the 
performance downgrades pretty badly. I guess it may be inevitable due to tree 
traversal. Should I add this test to capture it?


was (Author: guoger):
Should we also add benchmark tests with hierarchical roles? More specifically, 
build a tree of clients and perform same procedures as 
{{Sorter_BENCHMARK_Test.FullSort}}.

> Benchmarks to validate perf impact of hierarchical sorting
> --
>
> Key: MESOS-7078
> URL: https://issues.apache.org/jira/browse/MESOS-7078
> Project: Mesos
>  Issue Type: Task
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Depending on how deeply we need to change the sorter/allocator, we should 
> ensure we take the time to run the existing benchmarks (and perhaps write new 
> benchmarks) to ensure we don't regress performance for existing 
> sorter/allocator use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)


[jira] [Commented] (MESOS-7078) Benchmarks to validate perf impact of hierarchical sorting

2017-04-17 Thread Jay Guo (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7078?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15970876#comment-15970876
 ] 

Jay Guo commented on MESOS-7078:


Should we also add benchmark tests with hierarchical roles? More specifically, 
build a tree of clients and perform same procedures as 
{{Sorter_BENCHMARK_Test.FullSort}}.

> Benchmarks to validate perf impact of hierarchical sorting
> --
>
> Key: MESOS-7078
> URL: https://issues.apache.org/jira/browse/MESOS-7078
> Project: Mesos
>  Issue Type: Task
>Reporter: Neil Conway
>Assignee: Neil Conway
>  Labels: mesosphere
>
> Depending on how deeply we need to change the sorter/allocator, we should 
> ensure we take the time to run the existing benchmarks (and perhaps write new 
> benchmarks) to ensure we don't regress performance for existing 
> sorter/allocator use cases.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)