found the log:
using aufs backend

so how about change backend fs to overlay?

2017-07-18 19:49 GMT+08:00 <thomas.kurm...@artorg.unibe.ch>:

> Hi,
>
> We are experiencing a bug on the mesos agent (1.3.0) when trying to
> start large docker images inside a mesos container. I have tried with
> multiple sizes of images and the threshold seems to lie somewhere
> around 4.5 GB. We have experienced this bug using both a custom
> framework (deep-mesos) and marathon. Here is a log of what is happening
> with the agent. This is not happening on smaller images.
>
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.784018 30042
> master.cpp:9320] Adding task git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0 with resources cpus(*)(allocated: *):4;
> mem(*)(allocated: *):25000; gpus(*)(allocated: *):1;
> ports(*)(allocated: *):[31000-31000] on agent 816e697d-62d2-465a-bf7c-
> 7b79901e07a3-S4 at slave(1)@130.92.124.103:5051 (otpc103.unibe.ch)
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.784235 30042
> master.cpp:4531] Launching task git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0 of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
> (Deep Mesos) with resources cpus(*)(allocated: *):4; mem(*)(allocated:
> *):25000; gpus(*)(allocated: *):1; ports(*)(allocated: *):[31000-31000]
> on agent 816e697d-62d2-465a-bf7c-7b79901e07a3-S4 at
> slave(1)@130.92.124.103:5051 (otpc103.unibe.ch)
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.785534 30023
> slave.cpp:1613] Got assigned task 'git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
> 0014
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786010 30038
> hierarchical.cpp:850] Updated allocation of framework c7161dd3-0bbc-
> 4032-92c2-5477082d2c08-0014 on agent 816e697d-62d2-465a-bf7c-
> 7b79901e07a3-S4 from gpus(*)(allocated: *):1; cpus(*)(allocated: *):8;
> mem(*)(allocated: *):31099; disk(*)(allocated: *):56156;
> ports(*)(allocated: *):[31000-32000] to gpus(*)(allocated: *):1;
> cpus(*)(allocated: *):8; mem(*)(allocated: *):31099; disk(*)(allocated:
> *):56156; ports(*)(allocated: *):[31000-32000]
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786223 30023
> gc.cpp:83] Unscheduling '/var/lib/mesos/agent/slaves/816e697d-62d2-
> 465a-bf7c-7b79901e07a3-S4/frameworks/c7161dd3-0bbc-4032-92c2-
> 5477082d2c08-0014' from gc
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.786487 30023
> slave.cpp:1894] Authorizing task 'git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
> 0014
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.787127 30029
> slave.cpp:2081] Launching task 'git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0' for framework c7161dd3-0bbc-4032-92c2-5477082d2c08-
> 0014
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.789391 30029
> paths.cpp:573] Trying to chown '/var/lib/mesos/agent/slaves/816e697d-
> 62d2-465a-bf7c-7b79901e07a3-S4/frameworks/c7161dd3-0bbc-4032-92c2-
> 5477082d2c08-0014/executors/git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0/runs/c2343739-4252-4778-8902-9bedd514c3cd' to user
> 'root'
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.789891 30029
> slave.cpp:6933] Launching executor 'git-default.033d2193-0c3c-4878-
> a63c-6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-
> 5477082d2c08-0014 with resources cpus(*)(allocated: *):0.1;
> mem(*)(allocated: *):32 in work directory
> '/var/lib/mesos/agent/slaves/816e697d-62d2-465a-bf7c-7b79901e07a3-
> S4/frameworks/c7161dd3-0bbc-4032-92c2-5477082d2c08-0014/executors/git-
> default.033d2193-0c3c-4878-a63c-6bbfb24df6e0-O0/runs/c2343739-4252-
> 4778-8902-9bedd514c3cd'
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.790630 30029
> slave.cpp:2310] Queued task 'git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0' for executor 'git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.790971 30022
> docker.cpp:1148] Skipping non-docker container
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.791677 30028
> containerizer.cpp:1001] Starting container c2343739-4252-4778-8902-
> 9bedd514c3cd for executor 'git-default.033d2193-0c3c-4878-a63c-
> 6bbfb24df6e0-O0' of framework c7161dd3-0bbc-4032-92c2-5477082d2c08-0014
> Jul 18 13:30:33 otpc103 rc.local[29950]: I0718 13:30:33.799257 30028
> provisioner.cpp:453] Provisioning image rootfs
> '/var/lib/mesos/agent/provisioner/containers/c2343739-4252-4778-8902-
> 9bedd514c3cd/backends/aufs/rootfses/2eed6b86-66f1-46a0-9fc3-
> 1c8b22bff399' for container c2343739-4252-4778-8902-9bedd514c3cd using
> aufs backend
> Jul 18 13:30:33 otpc103 kernel: [673973.912396] general protection
> fault: 0000 [#2] SMP
> Jul 18 13:30:33 otpc103 kernel: [673973.912403] Modules linked in: veth
> ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink
> xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4
> nf_nat_ipv4 xt_addrtype iptable_filter ip_tables xt_conntrack x_tables
> nf_nat nf_conntrack br_netfilter bridge stp llc aufs nfsv3 nfs_acl
> rpcsec_gss_krb5 auth_rpcgss nfsv4 nfs lockd grace fscache
> nvidia_uvm(POE) nls_iso8859_1 snd_hda_codec_hdmi nvidia_drm(POE)
> nvidia_modeset(POE) nvidia(POE) intel_rapl x86_pkg_temp_thermal
> intel_powerclamp kvm_intel kvm snd_hda_codec_realtek irqbypass
> crct10dif_pclmul snd_hda_codec_generic crc32_pclmul ghash_clmulni_intel
> snd_soc_rt5640 aesni_intel snd_soc_rl6231 snd_hda_intel aes_x86_64
> drm_kms_helper snd_soc_ssm4567 lrw snd_hda_codec gf128mul snd_soc_core
> glue_helper ablk_helper drm cryptd snd_hda_core snd_compress ac97_bus
> snd_hwdep snd_pcm_dmaengine serio_raw snd_pcm fb_sys_fops syscopyarea
> mei_me mei lpc_ich sysfillrect snd_seq_midi snd_seq_midi_event
> sysimgblt snd_rawmidi snd_seq snd_seq_device 8250_fintek snd_timer snd
> elan_i2c shpchp soundcore dw_dmac snd_soc_sst_acpi
> i2c_designware_platform dw_dmac_core i2c_designware_core 8250_dw
> spi_pxa2xx_platform mac_hid intel_smartconnect acpi_pad coretemp sunrpc
> parport_pc ppdev lp parport autofs4 mxm_wmi psmouse e1000e ahci libahci
> ptp pps_core wmi sdhci_acpi video sdhci i2c_hid hid fjes
> Jul 18 13:30:33 otpc103 kernel: [673973.912521] CPU: 4 PID: 30029 Comm:
> mesos-agent Tainted: P      D    OE   4.4.0-57-generic #78-Ubuntu
> Jul 18 13:30:33 otpc103 kernel: [673973.912525] Hardware name: To Be
> Filled By O.E.M. To Be Filled By O.E.M./Z97 Extreme4, BIOS P1.30
> 05/23/2014
> Jul 18 13:30:33 otpc103 kernel: [673973.912529] task: ffff8807f6688e00
> ti: ffff8807e1b08000 task.ti: ffff8807e1b08000
> Jul 18 13:30:33 otpc103 kernel: [673973.912532] RIP:
> 0010:[<ffffffff81225983>]  [<ffffffff81225983>] dput+0x23/0x220
> Jul 18 13:30:33 otpc103 kernel: [673973.912543] RSP:
> 0018:ffff8807e1b0bc00  EFLAGS: 00010246
> Jul 18 13:30:33 otpc103 kernel: [673973.912545] RAX: 0000000000000000
> RBX: 6b7365642e74756c RCX: 0000002b00000000
> Jul 18 13:30:33 otpc103 kernel: [673973.912548] RDX: 0000000080000000
> RSI: ffff88081ed1a080 RDI: 6b7365642e74756c
> Jul 18 13:30:33 otpc103 kernel: [673973.912550] RBP: ffff8807e1b0bc28
> R08: 000000000001a080 R09: ffffffffc077a9f5
> Jul 18 13:30:33 otpc103 kernel: [673973.912552] R10: ffffea000349f300
> R11: ffff8800ddc9d000 R12: 6b7365642e7475c4
> Jul 18 13:30:33 otpc103 kernel: [673973.912555] R13: ffff8807e1b0bd18
> R14: 0000000000000055 R15: 00000000fffffff9
> Jul 18 13:30:33 otpc103 kernel: [673973.912559]
> FS:  00007fe9f8a16700(0000) GS:ffff88081ed00000(0000)
> knlGS:0000000000000000
> Jul 18 13:30:33 otpc103 kernel: [673973.912562] CS:  0010 DS: 0000 ES:
> 0000 CR0: 0000000080050033
> Jul 18 13:30:33 otpc103 kernel: [673973.912564] CR2: 00007f0210007028
> CR3: 00000007e1ca0000 CR4: 00000000001406e0
> Jul 18 13:30:33 otpc103 kernel: [673973.912567] DR0: 0000000000000000
> DR1: 0000000000000000 DR2: 0000000000000000
> Jul 18 13:30:33 otpc103 kernel: [673973.912569] DR3: 0000000000000000
> DR6: 00000000fffe0ff0 DR7: 0000000000000400
> Jul 18 13:30:33 otpc103 kernel: [673973.912571] Stack:
> Jul 18 13:30:33 otpc103 kernel: [673973.912573]  ffff8807e08cd060
> ffff8800d27cce40 ffff8807e1b0bd18 0000000000000055
> Jul 18 13:30:33 otpc103 kernel: [673973.912578]  00000000fffffff9
> ffff8807e1b0bc40 ffffffff812185a6 ffff8807e08cd050
> Jul 18 13:30:33 otpc103 kernel: [673973.912583]  ffff8807e1b0bc58
> ffffffffc077a8ae ffff8807e08bfff0 ffff8807e1b0bcf8
> Jul 18 13:30:33 otpc103 kernel: [673973.912587] Call Trace:
> Jul 18 13:30:33 otpc103 kernel: [673973.912597]  [<ffffffff812185a6>]
> path_put+0x16/0x30
> Jul 18 13:30:33 otpc103 kernel: [673973.912613]  [<ffffffffc077a8ae>]
> au_opts_free+0x4e/0x60 [aufs]
> Jul 18 13:30:33 otpc103 kernel: [673973.912625]  [<ffffffffc077a9fd>]
> au_opts_parse+0x13d/0x9a0 [aufs]
> Jul 18 13:30:33 otpc103 kernel: [673973.912632]  [<ffffffff811ee2b8>] ?
> __kmalloc+0x208/0x250
> Jul 18 13:30:33 otpc103 kernel: [673973.912646]  [<ffffffffc0782721>] ?
> au_di_alloc+0x61/0xc0 [aufs]
> Jul 18 13:30:33 otpc103 kernel: [673973.912656]  [<ffffffffc0773ba8>]
> aufs_fill_super+0x1a8/0x3c0 [aufs]
> Jul 18 13:30:33 otpc103 kernel: [673973.912665]  [<ffffffffc0773a00>] ?
> au_iget_locked+0x80/0x80 [aufs]
> Jul 18 13:30:33 otpc103 kernel: [673973.912670]  [<ffffffff81211aed>]
> mount_nodev+0x4d/0xa0
> Jul 18 13:30:33 otpc103 kernel: [673973.912677]  [<ffffffff811e255c>] ?
> alloc_pages_current+0x8c/0x110
> Jul 18 13:30:33 otpc103 kernel: [673973.912686]  [<ffffffffc0772ecd>]
> aufs_mount+0x1d/0xe0 [aufs]
> Jul 18 13:30:33 otpc103 kernel: [673973.912690]  [<ffffffff812127d8>]
> mount_fs+0x38/0x160
> Jul 18 13:30:33 otpc103 kernel: [673973.912696]  [<ffffffff8122e877>]
> vfs_kern_mount+0x67/0x110
> Jul 18 13:30:33 otpc103 kernel: [673973.912701]  [<ffffffff81231169>]
> do_mount+0x269/0xde0
> Jul 18 13:30:33 otpc103 kernel: [673973.912706]  [<ffffffff8123201f>]
> SyS_mount+0x9f/0x100
> Jul 18 13:30:33 otpc103 kernel: [673973.912713]  [<ffffffff818374f2>]
> entry_SYSCALL_64_fastpath+0x16/0x71
> Jul 18 13:30:33 otpc103 kernel: [673973.912715] Code: ff ff 66 0f 1f 44
> 00 00 0f 1f 44 00 00 48 85 ff 74 53 55 48 89 e5 41 57 41 56 41 55 41 54
> 4c 8d 67 58 53 48 89 fb e8 8d dc 60 00 <f6> 03 08 4c 89 e7 0f 85 84 00
> 00 00 e8 5c fd 1d 00 85 c0 0f 88
> Jul 18 13:30:33 otpc103 kernel: [673973.912766]
> RIP  [<ffffffff81225983>] dput+0x23/0x220
> Jul 18 13:30:33 otpc103 kernel: [673973.912771]  RSP <ffff8807e1b0bc00>
> Jul 18 13:30:33 otpc103 kernel: [673973.912776] ---[ end trace
> 2e255b1cc53ddbcc ]---
>
> The agent / master / zookeeper are started with:
>
> service zookeeper start
>
> mesos-master --zk=zk://localhost:2181/mesos --work_dir=/var/lib/mesos$
> --quorum=1 --log_dir=/var/log/mesos --cluster=TomDev
>
> mesos-agent  --master=otpc103.unibe.ch:5050 --
> work_dir=/var/lib/mesos/agent --image_providers=docker --
> executor_environment_variables="{}" --
> isolation="docker/runtime,filesystem/linux,cgroups/devices,gpu/nvidia,c
> groups/cpu,cgroups/mem,namespaces/pid" --containerizers=mesos,docker --
> nvidia_gpu_devices="0," --
> resources="gpus:1"  --executor_registration_timeout=5mins
>
>
> To replicate the error you can try to start a mesos container with this
> docker image (note your agent may crash as mine does)
> jgrossrieder/otl-keras-scipy-opencv
>
>
> Here is the accept frame:
> {"accept": {"offer_ids": [{"value": "5869827a-c328-4fdf-99b1-
> f73e816628c9-O0"}], "filters": {"refuse_seconds": 5.0}, "operations":
> [{"launch": {"task_infos": [{"name": "git-default", "command":
> {"arguments": [], "shell": false, "environment": {"variables":
> [{"name": "LD_LIBRARY_PATH", "value": "/usr/local/nvidia/lib64"},
> {"name": "PATH", "value":
> "/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/b
> in:/usr/sbin:/usr/bin:/sbin:/bin"}, {"name": "PORT_22", "value":
> "31000"}]}, "user": "root", "value": null}, "agent_id": {"value":
> "816e697d-62d2-465a-bf7c-7b79901e07a3-S4"}, "resources":
> [{"allocation_info": {"role": "*"}, "name": "cpus", "role": "*",
> "scalar": {"value": 4.0}, "type": "SCALAR"}, {"allocation_info":
> {"role": "*"}, "name": "mem", "role": "*", "scalar": {"value":
> 25000.0}, "type": "SCALAR"}, {"allocation_info": {"role": "*"}, "name":
> "gpus", "role": "*", "scalar": {"value": 1.0}, "type": "SCALAR"},
> {"allocation_info": {"role": "*"}, "name": "ports", "role": "*",
> "ranges": {"range": [{"end": 31000, "begin": 31000}]}, "type":
> "RANGES"}], "task_id": {"value": "git-default.5869827a-c328-4fdf-99b1-
> f73e816628c9-O0"}, "container": {"network_infos": {"port_mappings":
> [{"host_port": 31000, "container_port": 22}]}, "mesos": {"image":
> {"docker": {"name": "jgrossrieder/otl-keras-scipy-opencv"}, "type":
> "DOCKER"}}, "type": "MESOS"}}]}, "type": "LAUNCH"}]}, "framework_id":
> {"value": "c7161dd3-0bbc-4032-92c2-5477082d2c08-0014"}, "type":
> "ACCEPT"}
>
>
>
> Is anyone aware of this bug or a possible workaround?
>
> Thanks,
>
> Tom




-- 
Deshi Xiao
Twitter: xds2000
E-mail: xiaods(AT)gmail.com

Reply via email to