Re: Agent Error with Large Docker Images

2017-07-19 Thread thomas.kurmann
Thank you very much for that hint, with overlayfs it is now working!

On Wed, 2017-07-19 at 07:49 +0800, tommy xiao wrote:
found the log:
using aufs backend

so how about change backend fs to overlay?

2017-07-18 19:49 GMT+08:00 
>:
Hi,

We are experiencing a bug on the mesos agent (1.3.0) when trying to
start large docker images inside a mesos container. I have tried with
multiple sizes of images and the threshold seems to lie somewhere
around 4.5 GB. We have experienced this bug using both a custom
framework (deep-mesos) and marathon. Here is a log of what is happening
with the agent. This is not happening on smaller images.

Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.784018 30042
master.cpp:9320] Adding task git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0 with resources cpus(*)(allocated: *):4;
mem(*)(allocated: *):25000; gpus(*)(allocated: *):1;
ports(*)(allocated: *):[31000-31000] on agent 816e697d-62d2-465a-bf7c-
7b79901e07a3-S4 at slave(1)@130.92.124.103:5051 
(otpc103.unibe.ch)
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.784235 30042
master.cpp:4531] Launching task git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0 of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
(Deep Mesos) with resources cpus(*)(allocated: *):4; mem(*)(allocated:
*):25000; gpus(*)(allocated: *):1; ports(*)(allocated: *):[31000-31000]
on agent 816e697d-62d2-465a-bf7c-7b79901e07a3-S4 at
slave(1)@130.92.124.103:5051 
(otpc103.unibe.ch)
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.785534 30023
slave.cpp:1613] Got assigned task 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786010 30038
hierarchical.cpp:850] Updated allocation of framework c7161dd3-0bbc-
4032-92c2-5477082d2c08-0014 on agent 816e697d-62d2-465a-bf7c-
7b79901e07a3-S4 from gpus(*)(allocated: *):1; cpus(*)(allocated: *):8;
mem(*)(allocated: *):31099; disk(*)(allocated: *):56156;
ports(*)(allocated: *):[31000-32000] to gpus(*)(allocated: *):1;
cpus(*)(allocated: *):8; mem(*)(allocated: *):31099; disk(*)(allocated:
*):56156; ports(*)(allocated: *):[31000-32000]
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786223 30023
gc.cpp:83] Unscheduling '/var/lib/mesos/agent/slaves/816e697d-62d2-
465a-bf7c-7b79901e07a3-S4/frameworks/c7161dd3-0bbc-4032-92c2-
5477082d2c08-0014' from gc
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786487 30023
slave.cpp:1894] Authorizing task 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.787127 30029
slave.cpp:2081] Launching task 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.789391 30029
paths.cpp:573] Trying to chown '/var/lib/mesos/agent/slaves/816e697d-
62d2-465a-bf7c-7b79901e07a3-S4/frameworks/c7161dd3-0bbc-4032-92c2-
5477082d2c08-0014/executors/git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0/runs/c2343739-4252-4778-8902-9bedd514c3cd' to user
'root'
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.789891 30029
slave.cpp:6933] Launching executor 'git-default.033d2193-0c3c-4878-
a63c-6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-
5477082d2c08-0014 with resources cpus(*)(allocated: *):0.1;
mem(*)(allocated: *):32 in work directory
'/var/lib/mesos/agent/slaves/816e697d-62d2-465a-bf7c-7b79901e07a3-
S4/frameworks/c7161dd3-0bbc-4032-92c2-5477082d2c08-0014/executors/git-
default.033d2193-0c3c-4878-a63c-6bbfb24df6e0-O0/runs/c2343739-4252-
4778-8902-9bedd514c3cd'
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.790630 30029
slave.cpp:2310] Queued task 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' for executor 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.790971 30022
docker.cpp:1148] Skipping non-docker container
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.791677 30028
containerizer.cpp:1001] Starting container c2343739-4252-4778-8902-
9bedd514c3cd for executor 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.799257 30028
provisioner.cpp:453] Provisioning image rootfs
'/var/lib/mesos/agent/provisioner/containers/c2343739-4252-4778-8902-
9bedd514c3cd/backends/aufs/rootfses/2eed6b86-66f1-46a0-9fc3-
1c8b22bff399' for container c2343739-4252-4778-8902-9bedd514c3cd using
aufs backend
Jul 18 13:30:33 otpc103 kernel: [673973.912396] general protection
fault:  [#2] SMP
Jul 18 

Agent Error with Large Docker Images

2017-07-18 Thread thomas.kurmann
Hi,

We are experiencing a bug on the mesos agent (1.3.0) when trying to
start large docker images inside a mesos container. I have tried with
multiple sizes of images and the threshold seems to lie somewhere
around 4.5 GB. We have experienced this bug using both a custom
framework (deep-mesos) and marathon. Here is a log of what is happening
with the agent. This is not happening on smaller images. 

Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.784018 30042
master.cpp:9320] Adding task git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0 with resources cpus(*)(allocated: *):4;
mem(*)(allocated: *):25000; gpus(*)(allocated: *):1;
ports(*)(allocated: *):[31000-31000] on agent 816e697d-62d2-465a-bf7c-
7b79901e07a3-S4 at slave(1)@130.92.124.103:5051 (otpc103.unibe.ch)
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.784235 30042
master.cpp:4531] Launching task git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0 of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
(Deep Mesos) with resources cpus(*)(allocated: *):4; mem(*)(allocated:
*):25000; gpus(*)(allocated: *):1; ports(*)(allocated: *):[31000-31000] 
on agent 816e697d-62d2-465a-bf7c-7b79901e07a3-S4 at
slave(1)@130.92.124.103:5051 (otpc103.unibe.ch)
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.785534 30023
slave.cpp:1613] Got assigned task 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786010 30038
hierarchical.cpp:850] Updated allocation of framework c7161dd3-0bbc-
4032-92c2-5477082d2c08-0014 on agent 816e697d-62d2-465a-bf7c-
7b79901e07a3-S4 from gpus(*)(allocated: *):1; cpus(*)(allocated: *):8;
mem(*)(allocated: *):31099; disk(*)(allocated: *):56156;
ports(*)(allocated: *):[31000-32000] to gpus(*)(allocated: *):1;
cpus(*)(allocated: *):8; mem(*)(allocated: *):31099; disk(*)(allocated:
*):56156; ports(*)(allocated: *):[31000-32000]
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786223 30023
gc.cpp:83] Unscheduling '/var/lib/mesos/agent/slaves/816e697d-62d2-
465a-bf7c-7b79901e07a3-S4/frameworks/c7161dd3-0bbc-4032-92c2-
5477082d2c08-0014' from gc
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786487 30023
slave.cpp:1894] Authorizing task 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.787127 30029
slave.cpp:2081] Launching task 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.789391 30029
paths.cpp:573] Trying to chown '/var/lib/mesos/agent/slaves/816e697d-
62d2-465a-bf7c-7b79901e07a3-S4/frameworks/c7161dd3-0bbc-4032-92c2-
5477082d2c08-0014/executors/git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0/runs/c2343739-4252-4778-8902-9bedd514c3cd' to user
'root'
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.789891 30029
slave.cpp:6933] Launching executor 'git-default.033d2193-0c3c-4878-
a63c-6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-
5477082d2c08-0014 with resources cpus(*)(allocated: *):0.1;
mem(*)(allocated: *):32 in work directory
'/var/lib/mesos/agent/slaves/816e697d-62d2-465a-bf7c-7b79901e07a3-
S4/frameworks/c7161dd3-0bbc-4032-92c2-5477082d2c08-0014/executors/git-
default.033d2193-0c3c-4878-a63c-6bbfb24df6e0-O0/runs/c2343739-4252-
4778-8902-9bedd514c3cd'
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.790630 30029
slave.cpp:2310] Queued task 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' for executor 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.790971 30022
docker.cpp:1148] Skipping non-docker container
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.791677 30028
containerizer.cpp:1001] Starting container c2343739-4252-4778-8902-
9bedd514c3cd for executor 'git-default.033d2193-0c3c-4878-a63c-
6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.799257 30028
provisioner.cpp:453] Provisioning image rootfs
'/var/lib/mesos/agent/provisioner/containers/c2343739-4252-4778-8902-
9bedd514c3cd/backends/aufs/rootfses/2eed6b86-66f1-46a0-9fc3-
1c8b22bff399' for container c2343739-4252-4778-8902-9bedd514c3cd using
aufs backend
Jul 18 13:30:33 otpc103 kernel: [673973.912396] general protection
fault:  [#2] SMP 
Jul 18 13:30:33 otpc103 kernel: [673973.912403] Modules linked in: veth
ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink
xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables
nf_nat nf_conntrack br_netfilter bridge stp llc aufs nfsv3 nfs_acl
rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache
nvidia_uvm(POE) 

Agent Working Directory Best Practices

2017-06-22 Thread thomas.kurmann
Hi,

We have a couple of server nodes mainly used for computational tasks in
our mesos cluster. These servers have beefy cpus, gpus etc. but only
limited ssd space. We also have a 40GBe network and a decently fast
file server.

My question is simple but I didnt find an answer anywhere: What are the
best practices for the working directory on mesos-agent nodes? Should
we keep the working directory local or is it reasonable to use a nfs
mounted folder? We implemented both and they seem to work fine, but I
would rather like to follow "best practices".

Thanks and cheers

Tom