[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-10-12 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16202631#comment-16202631
 ] 

Jie Yu commented on MESOS-7130:
---

Thanks! committed the fix. Ping me if you guys need backport to an earlier 
version.

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Affects Versions: 1.1.3, 1.2.2, 1.3.1, 1.4.0
>Reporter: Pierre Cheynier
>Assignee: Jie Yu
> Fix For: 1.5.0
>
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-10-10 Thread Pierre Cheynier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16198468#comment-16198468
 ] 

Pierre Cheynier commented on MESOS-7130:


[~jieyu] Seems fixed, I'm now able to make my pipeline pass using EC2. So 
ashamed that I've been completely blind regarding a MTU issue :).

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Pierre Cheynier
>Assignee: Jie Yu
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-10-04 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16191565#comment-16191565
 ] 

Vinod Kone commented on MESOS-7130:
---

Story points?

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Pierre Cheynier
>Assignee: Jie Yu
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-10-02 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16189252#comment-16189252
 ] 

Jie Yu commented on MESOS-7130:
---

[~bgreen], [~pierrecdn], can you guys test this patch:
https://reviews.apache.org/r/62743/

Let me know if that fix the issue or not.

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Pierre Cheynier
>Assignee: Jie Yu
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-09-25 Thread Pierre Cheynier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178853#comment-16178853
 ] 

Pierre Cheynier commented on MESOS-7130:


Interesting feedback, I had no time to pursue on that in February, I'll try to 
see if it fixes the issue in my case.

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-09-24 Thread Bill Green (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178566#comment-16178566
 ] 

Bill Green commented on MESOS-7130:
---

I'm glad it was helpful [~jieyu]. Thanks for the great work on this isolator.

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-09-24 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178504#comment-16178504
 ] 

Jie Yu commented on MESOS-7130:
---

[~bgreen] thanks for the info. I checked the code, looks like we set 'lo' mtu 
to be the same as host eth0, but we forgot to set the eth0 inside the namespace 
to match that of the host eth0.

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-09-24 Thread Bill Green (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16178496#comment-16178496
 ] 

Bill Green commented on MESOS-7130:
---

I just ran into this exact same problem using the port-mapping-isolator 
compiled into a DCOS build (Mesos 1.2.2, DCOS 1.9.2). The problem agents are 
running on bare metal CoreOS 1298.7.0. 

We use bonding on these hosts, so our interface name is bond0.

I noticed that inside the network namespace, the MTU for the bond0 interface 
showed as 1500, which is different than the host bond0, on which we use an mtu 
of 9000. 

After I changed our host's bond0 MTU to 1500, the port-mapping-isolator behaved 
as expected. 

Seems like the isolator assumes an MTU value of 1500 when it mirrors the 
interface, which breaks path MTU discovery. 

It wouldn't surprise me if the MTUs on AWS hosts were different than 1500.

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-03-20 Thread Dominic Gregoire (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15932515#comment-15932515
 ] 

Dominic Gregoire commented on MESOS-7130:
-

I might have run into the same issue, using {{mesos 1.1.0}} with {{libnl 
3.2.29}}, on an instance with an {{ena}} interface.

The agent is running with these flags:
{noformat}
export MESOS_isolation=cgroups/cpu,cgroups/mem,network/port_mapping
export MESOS_containerizers=mesos
export MESOS_resources="ports:[31000-32000];ephemeral_ports:[32768-57344]"
export MESOS_ephemeral_ports_per_container=1024
{noformat}

Running spark 2.1.0 with 2 mesos containers on the same host, they can connect 
to each other’s block manager but can’t send traffic, it stays in their send-q.

Spark is logging:
{noformat}
17/03/19 16:54:56 INFO TransportClientFactory: Successfully created connection 
to ip-10-32-20-34.ec2.internal/10.32.20.34:34294 after 12 ms (0 ms spent in 
bootstraps)
17/03/19 16:56:56 ERROR TransportChannelHandler: Connection to 
ip-10-32-20-34.ec2.internal/10.32.20.34:34294 has been quiet for 12 ms 
while there are outstanding requests. Assuming connection is dead;
please adjust spark.network.timeout if this is wrong.
{noformat}

I can see connections established between containers but everything stays in 
the send Qs:
{noformat}
[root@ip-10-32-20-34 sysctl.d]# ip netns
4602 (id: 1)
4600 (id: 0)
[root@ip-10-32-20-34 sysctl.d]# ip netns exec 4600 netstat -an
Connexions Internet actives (serveurs et établies)
Proto Recv-Q Send-Q Local Address   Foreign Address 
State
tcp0  0 10.32.20.34:32861   0.0.0.0:*   
LISTEN
tcp0  0 0.0.0.0:33003   0.0.0.0:*   
LISTEN
tcp0  0 10.32.20.34:33003   10.32.20.34:57363   
ESTABLISHED
tcp0  0 10.32.20.34:33566   10.32.20.34:34294   
ESTABLISHED
tcp0  0 10.32.20.34:33658   10.32.18.185:40600  
ESTABLISHED
tcp0  0 10.32.20.34:32832   10.32.18.185:40196  
ESTABLISHED
tcp0  0 10.32.20.34:33406   10.32.20.34:5051
ESTABLISHED
Sockets du domaine UNIX actives(serveurs et établies)
Proto RefCpt Indicatrs   Type   Etat  I-Node Chemin
unix  2  [ ] STREAM CONNECTE  21869
unix  2  [ ] STREAM CONNECTE  20339
[root@ip-10-32-20-34 sysctl.d]# ip netns exec 4602 netstat -an
Connexions Internet actives (serveurs et établies)
Proto Recv-Q Send-Q Local Address   Foreign Address 
State
tcp0  0 0.0.0.0:33836   0.0.0.0:*   
LISTEN
tcp0  0 10.32.20.34:34294   0.0.0.0:*   
LISTEN
tcp0  24229 10.32.20.34:34294   10.32.20.34:33566   
ESTABLISHED
tcp0  0 10.32.20.34:33860   10.32.18.185:40196  
ESTABLISHED
tcp0  0 10.32.20.34:34680   10.32.18.185:40600  
ESTABLISHED
tcp0  0 10.32.20.34:34434   10.32.20.34:5051
ESTABLISHED
tcp0  0 10.32.20.34:33836   10.32.20.34:58149   
ESTABLISHED
Sockets du domaine UNIX actives(serveurs et établies)
Proto RefCpt Indicatrs   Type   Etat  I-Node Chemin
unix  2  [ ] STREAM CONNECTE  20359
unix  2  [ ] STREAM CONNECTE  20373
{noformat}


> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: ec2, executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-02-16 Thread Pierre Cheynier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870440#comment-15870440
 ] 

Pierre Cheynier commented on MESOS-7130:


Update: I tried to test with Intel interface & driver instead of vif 
(docs.aws.amazon.com/en_en/AWSEC2/latest/UserGuide/sriov-networking.html), but 
I now have issues related to networking, my box is just not able to fetch its 
config, SSH keys etc. I probably have to check the Intel ixgbevf driver...

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: ec2, executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-02-16 Thread Pierre Cheynier (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15869834#comment-15869834
 ] 

Pierre Cheynier commented on MESOS-7130:


[~avinash.mesos] Here is my setup :
* CentOS 7.2.1511 
* LTS Kernel (4.4.21 at that time because we have an internal frozen mechanism)
* libnl 3.2.28 (we moved to the one published in January on CentOS repos: 
https://www.rpmfind.net/linux/RPM/centos/updates/7.3.1611/x86_64/Packages/libnl3-3.2.28-3.el7_3.x86_64.html).

Every environement (physical, vbox, EC2) are on the same setup and even points 
to the same internal RPM mirrors so every package should be the same.
We use packer to build our images and most of the steps are common between vbox 
and EC2 AMIs (kernel upgrade, internal mirror, etc etc).

The install of the mesos stack itself is performed via Chef and test-kitchen.

The executor is the default {{mesos-executor}} (command-executor) in any case.

We are currently thinking about a side-effects of doing things like {{ethtool 
-K eth0 rx off}}, setting the same MAC in the netns, etc etc. on the {{vif}} 
driver (here for Xen virtual intertface - network paravirtualization).

Typically, some communications seems partially blocked...
{noformat}
# eth0 interface in the root netns
[centos@ip-10-0-143-253 ~]$ ip a s eth0
2: eth0:  mtu 9001 qdisc fq_codel state UP 
qlen 1000
link/ether 02:64:28:aa:e4:0d brd ff:ff:ff:ff:ff:ff
inet 10.0.143.253/16 brd 10.0.255.255 scope global dynamic eth0
   valid_lft 3436sec preferred_lft 3436sec
inet6 fe80::64:28ff:feaa:e40d/64 scope link 
   valid_lft forever preferred_lft forever
# Enter the netns of the task
[centos@ip-10-0-143-253 ~]$ sudo nsenter -t 9039 -n 
# Curl a simple endpoint that will return a short answer
[root@ip-10-0-143-253 centos]# curl -vv http://10.0.143.253:5051/ -m 5
* About to connect() to 10.0.143.253 port 5051 (#0)
*   Trying 10.0.143.253...
* Connected to 10.0.143.253 (10.0.143.253) port 5051 (#0)
> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.0.143.253:5051
> Accept: */*
> 
< HTTP/1.1 404 Not Found
< Date: Thu, 16 Feb 2017 12:28:08 GMT
< Content-Length: 0
< 
* Connection #0 to host 10.0.143.253 left intact
# Now curl something bigger
[root@ip-10-0-143-253 centos]# curl -vv 
http://10.0.143.253:5051/monitor/statistics.json -m 5
* About to connect() to 10.0.143.253 port 5051 (#0)
*   Trying 10.0.143.253...
* Connected to 10.0.143.253 (10.0.143.253) port 5051 (#0)
> GET /monitor/statistics.json HTTP/1.1
> User-Agent: curl/7.29.0
> Host: 10.0.143.253:5051
> Accept: */*
> 
* Operation timed out after 5001 milliseconds with 0 out of -1 bytes received
* Closing connection 0
curl: (28) Operation timed out after 5001 milliseconds with 0 out of -1 bytes 
received
[root@ip-10-0-143-253 centos]# ip  a s
1: lo:  mtu 9001 qdisc noqueue state UNKNOWN qlen 1
link/loopback 02:64:28:aa:e4:0d brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
   valid_lft forever preferred_lft forever
2: eth0@if16:  mtu 1500 qdisc noqueue state UP 
qlen 1000
link/ether 02:64:28:aa:e4:0d brd ff:ff:ff:ff:ff:ff link-netnsid 0
inet 10.0.143.253/16 scope global eth0
   valid_lft forever preferred_lft forever
{noformat}

Doing a tcpdump in the netns shows the 3-way handshake, the sender payload, the 
corresponding ACK and ... nothing.

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: ec2, executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-02-15 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868815#comment-15868815
 ] 

Avinash Sridharan commented on MESOS-7130:
--

[~pierrecdn] what is the distro you are running Mesos on EC2? Is it different 
than what you are running on the vagrant boxes? The port-mapping isolator 
relies on libnl as well so might be worth comparing the libnl versions on the 
platform as well.

That said, the port-mapping isolator doesn't have any dependency on the 
underlying network, so technically it should matter if you are running it on 
your vagrant or EC2.

Also, I am assuming this is the command executor and not a custom executor that 
you are trying to run?

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: ec2, executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> 

[jira] [Commented] (MESOS-7130) port_mapping isolator: executor hangs when running on EC2

2017-02-15 Thread Anand Mazumdar (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7130?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15868493#comment-15868493
 ] 

Anand Mazumdar commented on MESOS-7130:
---

[~gilbert] [~avinash.mesos] Do you have any insights into this?

> port_mapping isolator: executor hangs when running on EC2
> -
>
> Key: MESOS-7130
> URL: https://issues.apache.org/jira/browse/MESOS-7130
> Project: Mesos
>  Issue Type: Bug
>  Components: ec2, executor
>Reporter: Pierre Cheynier
>
> Hi,
> I'm experiencing a weird issue: I'm using a CI to do testing on 
> infrastructure automation.
> I recently activated the {{network/port_mapping}} isolator.
> I'm able to make the changes work and pass the test for bare-metal servers 
> and virtualbox VMs using this configuration.
> But when I try on EC2 (on which my CI pipeline rely) it systematically fails 
> to run any container.
> It appears that the sandbox is created and the port_mapping isolator seems to 
> be OK according to the logs in stdout and stderr and the {tc} output :
> {noformat}
> + mount --make-rslave /run/netns
> + test -f /proc/sys/net/ipv6/conf/all/disable_ipv6
> + echo 1
> + ip link set lo address 02:44:20:bb:42:cf mtu 9001 up
> + ethtool -K eth0 rx off
> (...)
> + tc filter show dev eth0 parent :0
> + tc filter show dev lo parent :0
> I0215 16:01:13.941375 1 exec.cpp:161] Version: 1.0.2
> {noformat}
> Then the executor never come back in REGISTERED state and hang indefinitely.
> {GLOG_v=3} doesn't help here.
> My skills in this area are limited, but trying to load the symbols and attach 
> a gdb to the mesos-executor process, I'm able to print this stack:
> {noformat}
> #0  0x7feffc1386d5 in pthread_cond_wait@@GLIBC_2.3.2 () from 
> /usr/lib64/libpthread.so.0
> #1  0x7feffbed69ec in 
> std::condition_variable::wait(std::unique_lock&) () from 
> /usr/lib64/libstdc++.so.6
> #2  0x7ff0003dd8ec in void synchronized_wait std::mutex>(std::condition_variable*, std::mutex*) () from 
> /usr/lib64/libmesos-1.0.2.so
> #3  0x7ff0017d595d in Gate::arrive(long) () from 
> /usr/lib64/libmesos-1.0.2.so
> #4  0x7ff0017c00ed in process::ProcessManager::wait(process::UPID const&) 
> () from /usr/lib64/libmesos-1.0.2.so
> #5  0x7ff0017c5c05 in process::wait(process::UPID const&, Duration 
> const&) () from /usr/lib64/libmesos-1.0.2.so
> #6  0x004ab26f in process::wait(process::ProcessBase const*, Duration 
> const&) ()
> #7  0x004a3903 in main ()
> {noformat}
> I concluded that the underlying shell script launched by the isolator or the 
> task itself is just .. blocked. But I don't understand why.
> Here is a process tree to show that I've no task running but the executor is:
> {noformat}
> root 28420  0.8  3.0 1061420 124940 ?  Ssl  17:56   0:25 
> /usr/sbin/mesos-slave --advertise_ip=127.0.0.1 
> --attributes=platform:centos;platform_major_version:7;type:base 
> --cgroups_enable_cfs --cgroups_hierarchy=/sys/fs/cgroup 
> --cgroups_net_cls_primary_handle=0xC370 
> --container_logger=org_apache_mesos_LogrotateContainerLogger 
> --containerizers=mesos,docker 
> --credential=file:///etc/mesos-chef/slave-credential 
> --default_container_info={"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"}]}
>  --default_role=default --docker_registry=/usr/share/mesos/users 
> --docker_store_dir=/var/opt/mesos/store/docker 
> --egress_unique_flow_per_container --enforce_container_disk_quota 
> --ephemeral_ports_per_container=128 
> --executor_environment_variables={"PATH":"/bin:/usr/bin:/usr/sbin","CRITEO_DC":"par","CRITEO_ENV":"prod"}
>  --image_providers=docker --image_provisioner_backend=copy 
> --isolation=cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,disk/du,filesystem/shared,filesystem/linux,docker/runtime,network/cni,network/port_mapping
>  --logging_level=INFO 
> --master=zk://mesos:test@localhost.localdomain:2181/mesos 
> --modules=file:///etc/mesos-chef/slave-modules.json --port=5051 
> --recover=reconnect 
> --resources=ports:[31000-32000];ephemeral_ports:[32768-57344] --strict 
> --work_dir=/var/opt/mesos
> root 28484  0.0  2.3 433676 95016 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
> --log_filename=/var/opt/mesos/slaves/cdf94219-87b2-4af2-9f61-5697f0442915-S0/frameworks/366e8ed2-730e-4423-9324-086704d182b0-/executors/group_simplehttp.16f7c2ee-f3a8-11e6-be1c-0242b44d071f/runs/1d3e6b1c-cda8-47e5-92c4-a161429a7ac6/stdout
>  --logrotate_options=rotate 5 --logrotate_path=logrotate --max_size=10MB
> root 28485  0.0  2.3 499212 94724 ?Ssl  17:56   0:00  \_ 
> mesos-logrotate-logger --help=false 
>