found the log: using aufs backend so how about change backend fs to overlay?
2017-07-18 19:49 GMT+08:00 <thomas.kurm...@artorg.unibe.ch>: > Hi, > > We are experiencing a bug on the mesos agent (1.3.0) when trying to > start large docker images inside a mesos container. I have tried with > multiple sizes of images and the threshold seems to lie somewhere > around 4.5 GB. We have experienced this bug using both a custom > framework (deep-mesos) and marathon. Here is a log of what is happening > with the agent. This is not happening on smaller images. > > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.784018 30042 > master.cpp:9320] Adding task git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0 with resources cpus(*)(allocated: *):4; > mem(*)(allocated: *):25000; gpus(*)(allocated: *):1; > ports(*)(allocated: *):[31000-31000] on agent 816e697d-62d2-465a-bf7c- > 7b79901e07a3-S4 at slave(1)@130.92.124.103:5051 (otpc103.unibe.ch) > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.784235 30042 > master.cpp:4531] Launching task git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0 of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014 > (Deep Mesos) with resources cpus(*)(allocated: *):4; mem(*)(allocated: > *):25000; gpus(*)(allocated: *):1; ports(*)(allocated: *):[31000-31000] > on agent 816e697d-62d2-465a-bf7c-7b79901e07a3-S4 at > slave(1)@130.92.124.103:5051 (otpc103.unibe.ch) > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.785534 30023 > slave.cpp:1613] Got assigned task 'git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08- > 0014 > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786010 30038 > hierarchical.cpp:850] Updated allocation of framework c7161dd3-0bbc- > 4032-92c2-5477082d2c08-0014 on agent 816e697d-62d2-465a-bf7c- > 7b79901e07a3-S4 from gpus(*)(allocated: *):1; cpus(*)(allocated: *):8; > mem(*)(allocated: *):31099; disk(*)(allocated: *):56156; > ports(*)(allocated: *):[31000-32000] to gpus(*)(allocated: *):1; > cpus(*)(allocated: *):8; mem(*)(allocated: *):31099; disk(*)(allocated: > *):56156; ports(*)(allocated: *):[31000-32000] > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786223 30023 > gc.cpp:83] Unscheduling '/var/lib/mesos/agent/slaves/816e697d-62d2- > 465a-bf7c-7b79901e07a3-S4/frameworks/c7161dd3-0bbc-4032-92c2- > 5477082d2c08-0014' from gc > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786487 30023 > slave.cpp:1894] Authorizing task 'git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08- > 0014 > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.787127 30029 > slave.cpp:2081] Launching task 'git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08- > 0014 > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.789391 30029 > paths.cpp:573] Trying to chown '/var/lib/mesos/agent/slaves/816e697d- > 62d2-465a-bf7c-7b79901e07a3-S4/frameworks/c7161dd3-0bbc-4032-92c2- > 5477082d2c08-0014/executors/git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0/runs/c2343739-4252-4778-8902-9bedd514c3cd' to user > 'root' > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.789891 30029 > slave.cpp:6933] Launching executor 'git-default.033d2193-0c3c-4878- > a63c-6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2- > 5477082d2c08-0014 with resources cpus(*)(allocated: *):0.1; > mem(*)(allocated: *):32 in work directory > '/var/lib/mesos/agent/slaves/816e697d-62d2-465a-bf7c-7b79901e07a3- > S4/frameworks/c7161dd3-0bbc-4032-92c2-5477082d2c08-0014/executors/git- > default.033d2193-0c3c-4878-a63c-6bbfb24df6e0-O0/runs/c2343739-4252- > 4778-8902-9bedd514c3cd' > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.790630 30029 > slave.cpp:2310] Queued task 'git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0' for executor 'git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014 > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.790971 30022 > docker.cpp:1148] Skipping non-docker container > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.791677 30028 > containerizer.cpp:1001] Starting container c2343739-4252-4778-8902- > 9bedd514c3cd for executor 'git-default.033d2193-0c3c-4878-a63c- > 6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014 > Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.799257 30028 > provisioner.cpp:453] Provisioning image rootfs > '/var/lib/mesos/agent/provisioner/containers/c2343739-4252-4778-8902- > 9bedd514c3cd/backends/aufs/rootfses/2eed6b86-66f1-46a0-9fc3- > 1c8b22bff399' for container c2343739-4252-4778-8902-9bedd514c3cd using > aufs backend > Jul 18 13:30:33 otpc103 kernel: [673973.912396] general protection > fault: 0000 [#2] SMP > Jul 18 13:30:33 otpc103 kernel: [673973.912403] Modules linked in: veth > ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink > xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 > nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables > nf_nat nf_conntrack br_netfilter bridge stp llc aufs nfsv3 nfs_acl > rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache > nvidia_uvm(POE) nls_iso8859_1 snd_hda_codec_hdmi nvidia_drm(POE) > nvidia_modeset(POE) nvidia(POE) intel_rapl x86_pkg_temp_thermal > intel_powerclamp kvm_intel kvm snd_hda_codec_realtek irqbypass > crct10dif_pclmul snd_hda_codec_generic crc32_pclmul ghash_clmulni_intel > snd_soc_rt5640 aesni_intel snd_soc_rl6231 snd_hda_intel aes_x86_64 > drm_kms_helper snd_soc_ssm4567 lrw snd_hda_codec gf128mul snd_soc_core > glue_helper ablk_helper drm cryptd snd_hda_core snd_compress ac97_bus > snd_hwdep snd_pcm_dmaengine serio_raw snd_pcm fb_sys_fops syscopyarea > mei_me mei lpc_ich sysfillrect snd_seq_midi snd_seq_midi_event > sysimgblt snd_rawmidi snd_seq snd_seq_device 8250_fintek snd_timer snd > elan_i2c shpchp soundcore dw_dmac snd_soc_sst_acpi > i2c_designware_platform dw_dmac_core i2c_designware_core 8250_dw > spi_pxa2xx_platform mac_hid intel_smartconnect acpi_pad coretemp sunrpc > parport_pc ppdev lp parport autofs4 mxm_wmi psmouse e1000e ahci libahci > ptp pps_core wmi sdhci_acpi video sdhci i2c_hid hid fjes > Jul 18 13:30:33 otpc103 kernel: [673973.912521] CPU: 4 PID: 30029 Comm: > mesos-agent Tainted: P D OE 4.4.0-57-generic #78-Ubuntu > Jul 18 13:30:33 otpc103 kernel: [673973.912525] Hardware name: To Be > Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P1.30 > 05/23/2014 > Jul 18 13:30:33 otpc103 kernel: [673973.912529] task: ffff8807f6688e00 > ti: ffff8807e1b08000 task.ti: ffff8807e1b08000 > Jul 18 13:30:33 otpc103 kernel: [673973.912532] RIP: > 0010:[<ffffffff81225983>] [<ffffffff81225983>] dput+0x23/0x220 > Jul 18 13:30:33 otpc103 kernel: [673973.912543] RSP: > 0018:ffff8807e1b0bc00 EFLAGS: 00010246 > Jul 18 13:30:33 otpc103 kernel: [673973.912545] RAX: 0000000000000000 > RBX: 6b7365642e74756c RCX: 0000002b00000000 > Jul 18 13:30:33 otpc103 kernel: [673973.912548] RDX: 0000000080000000 > RSI: ffff88081ed1a080 RDI: 6b7365642e74756c > Jul 18 13:30:33 otpc103 kernel: [673973.912550] RBP: ffff8807e1b0bc28 > R08: 000000000001a080 R09: ffffffffc077a9f5 > Jul 18 13:30:33 otpc103 kernel: [673973.912552] R10: ffffea000349f300 > R11: ffff8800ddc9d000 R12: 6b7365642e7475c4 > Jul 18 13:30:33 otpc103 kernel: [673973.912555] R13: ffff8807e1b0bd18 > R14: 0000000000000055 R15: 00000000fffffff9 > Jul 18 13:30:33 otpc103 kernel: [673973.912559] > FS: 00007fe9f8a16700(0000) GS:ffff88081ed00000(0000) > knlGS:0000000000000000 > Jul 18 13:30:33 otpc103 kernel: [673973.912562] CS: 0010 DS: 0000 ES: > 0000 CR0: 0000000080050033 > Jul 18 13:30:33 otpc103 kernel: [673973.912564] CR2: 00007f0210007028 > CR3: 00000007e1ca0000 CR4: 00000000001406e0 > Jul 18 13:30:33 otpc103 kernel: [673973.912567] DR0: 0000000000000000 > DR1: 0000000000000000 DR2: 0000000000000000 > Jul 18 13:30:33 otpc103 kernel: [673973.912569] DR3: 0000000000000000 > DR6: 00000000fffe0ff0 DR7: 0000000000000400 > Jul 18 13:30:33 otpc103 kernel: [673973.912571] Stack: > Jul 18 13:30:33 otpc103 kernel: [673973.912573] ffff8807e08cd060 > ffff8800d27cce40 ffff8807e1b0bd18 0000000000000055 > Jul 18 13:30:33 otpc103 kernel: [673973.912578] 00000000fffffff9 > ffff8807e1b0bc40 ffffffff812185a6 ffff8807e08cd050 > Jul 18 13:30:33 otpc103 kernel: [673973.912583] ffff8807e1b0bc58 > ffffffffc077a8ae ffff8807e08bfff0 ffff8807e1b0bcf8 > Jul 18 13:30:33 otpc103 kernel: [673973.912587] Call Trace: > Jul 18 13:30:33 otpc103 kernel: [673973.912597] [<ffffffff812185a6>] > path_put+0x16/0x30 > Jul 18 13:30:33 otpc103 kernel: [673973.912613] [<ffffffffc077a8ae>] > au_opts_free+0x4e/0x60 [aufs] > Jul 18 13:30:33 otpc103 kernel: [673973.912625] [<ffffffffc077a9fd>] > au_opts_parse+0x13d/0x9a0 [aufs] > Jul 18 13:30:33 otpc103 kernel: [673973.912632] [<ffffffff811ee2b8>] ? > __kmalloc+0x208/0x250 > Jul 18 13:30:33 otpc103 kernel: [673973.912646] [<ffffffffc0782721>] ? > au_di_alloc+0x61/0xc0 [aufs] > Jul 18 13:30:33 otpc103 kernel: [673973.912656] [<ffffffffc0773ba8>] > aufs_fill_super+0x1a8/0x3c0 [aufs] > Jul 18 13:30:33 otpc103 kernel: [673973.912665] [<ffffffffc0773a00>] ? > au_iget_locked+0x80/0x80 [aufs] > Jul 18 13:30:33 otpc103 kernel: [673973.912670] [<ffffffff81211aed>] > mount_nodev+0x4d/0xa0 > Jul 18 13:30:33 otpc103 kernel: [673973.912677] [<ffffffff811e255c>] ? > alloc_pages_current+0x8c/0x110 > Jul 18 13:30:33 otpc103 kernel: [673973.912686] [<ffffffffc0772ecd>] > aufs_mount+0x1d/0xe0 [aufs] > Jul 18 13:30:33 otpc103 kernel: [673973.912690] [<ffffffff812127d8>] > mount_fs+0x38/0x160 > Jul 18 13:30:33 otpc103 kernel: [673973.912696] [<ffffffff8122e877>] > vfs_kern_mount+0x67/0x110 > Jul 18 13:30:33 otpc103 kernel: [673973.912701] [<ffffffff81231169>] > do_mount+0x269/0xde0 > Jul 18 13:30:33 otpc103 kernel: [673973.912706] [<ffffffff8123201f>] > SyS_mount+0x9f/0x100 > Jul 18 13:30:33 otpc103 kernel: [673973.912713] [<ffffffff818374f2>] > entry_SYSCALL_64_fastpath+0x16/0x71 > Jul 18 13:30:33 otpc103 kernel: [673973.912715] Code: ff ff 66 0f 1f 44 > 00 00 0f 1f 44 00 00 48 85 ff 74 53 55 48 89 e5 41 57 41 56 41 55 41 54 > 4c 8d 67 58 53 48 89 fb e8 8d dc 60 00 <f6> 03 08 4c 89 e7 0f 85 84 00 > 00 00 e8 5c fd 1d 00 85 c0 0f 88 > Jul 18 13:30:33 otpc103 kernel: [673973.912766] > RIP [<ffffffff81225983>] dput+0x23/0x220 > Jul 18 13:30:33 otpc103 kernel: [673973.912771] RSP <ffff8807e1b0bc00> > Jul 18 13:30:33 otpc103 kernel: [673973.912776] ---[ end trace > 2e255b1cc53ddbcc ]--- > > The agent / master / zookeeper are started with: > > service zookeeper start > > mesos-master --zk=zk://localhost:2181/mesos --work_dir=/var/lib/mesos$ > --quorum=1 --log_dir=/var/log/mesos --cluster=TomDev > > mesos-agent --master=otpc103.unibe.ch:5050 -- > work_dir=/var/lib/mesos/agent --image_providers=docker -- > executor_environment_variables="{}" -- > isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia,c > groups/cpu,cgroups/mem,namespaces/pid" --containerizers=mesos,docker -- > nvidia_gpu_devices="0," -- > resources="gpus:1" --executor_registration_timeout=5mins > > > To replicate the error you can try to start a mesos container with this > docker image (note your agent may crash as mine does) > jgrossrieder/otl-keras-scipy-opencv > > > Here is the accept frame: > {"accept": {"offer_ids": [{"value": "5869827a-c328-4fdf-99b1- > f73e816628c9-O0"}], "filters": {"refuse_seconds": 5.0}, "operations": > [{"launch": {"task_infos": [{"name": "git-default", "command": > {"arguments": [], "shell": false, "environment": {"variables": > [{"name": "LD_LIBRARY_PATH", "value": "/usr/local/nvidia/lib64"}, > {"name": "PATH", "value": > "/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/b > in:/usr/sbin:/usr/bin:/sbin:/bin"}, {"name": "PORT_22", "value": > "31000"}]}, "user": "root", "value": null}, "agent_id": {"value": > "816e697d-62d2-465a-bf7c-7b79901e07a3-S4"}, "resources": > [{"allocation_info": {"role": "*"}, "name": "cpus", "role": "*", > "scalar": {"value": 4.0}, "type": "SCALAR"}, {"allocation_info": > {"role": "*"}, "name": "mem", "role": "*", "scalar": {"value": > 25000.0}, "type": "SCALAR"}, {"allocation_info": {"role": "*"}, "name": > "gpus", "role": "*", "scalar": {"value": 1.0}, "type": "SCALAR"}, > {"allocation_info": {"role": "*"}, "name": "ports", "role": "*", > "ranges": {"range": [{"end": 31000, "begin": 31000}]}, "type": > "RANGES"}], "task_id": {"value": "git-default.5869827a-c328-4fdf-99b1- > f73e816628c9-O0"}, "container": {"network_infos": {"port_mappings": > [{"host_port": 31000, "container_port": 22}]}, "mesos": {"image": > {"docker": {"name": "jgrossrieder/otl-keras-scipy-opencv"}, "type": > "DOCKER"}}, "type": "MESOS"}}]}, "type": "LAUNCH"}]}, "framework_id": > {"value": "c7161dd3-0bbc-4032-92c2-5477082d2c08-0014"}, "type": > "ACCEPT"} > > > > Is anyone aware of this bug or a possible workaround? > > Thanks, > > Tom -- Deshi Xiao Twitter: xds2000 E-mail: xiaods(AT)gmail.com