[jira] [Commented] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413374#comment-15413374 ] Michael Thomas commented on MESOS-6004: --- 1. I don't think its the downloading, as the image was already cached locally. the image_provision_backend {{overlayer}} does not work for me, as I am missing the kernel extension of that. 2. I updated the logs with GLOG_v=1 Also I tried with a smaller image (nvidia/cuda, 1.5GB 21 layers) and this works find and seems reasonable fast > Tasks fail when provisioning multiple containers with large docker images > using copy backend > - > > Key: MESOS-6004 > URL: https://issues.apache.org/jira/browse/MESOS-6004 > Project: Mesos > Issue Type: Bug > Components: containerization, docker >Affects Versions: 0.28.2, 1.0.0 > Environment: h4. Agent Platform > - Ubuntu 16.04 > - AWS g2.x2large instance > - Nvidia support enabled > h4. Agent Configuration > -{noformat} > --containerizers=mesos,docker > --docker_config= > --docker_store_dir=/mnt/mesos/store/docker > --executor_registration_timeout=3mins > --hostname= > --image_providers=docker > --image_provisioner_backend=copy > --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia > --switch_user=false > --work_dir=/mnt/mesos > {noformat} > h4. Framework > - custom framework written in python > - using unified containerizer with docker images > h4. Test Setup > * 1 master > * 1 agent > * 5 tasks scheduled at the same time: > ** resources: cpus: 0.1, mem: 128 > ** command: `echo test` > ** docker image: custom docker image, based on nvidia/cuda ~5gb > ** the same docker image was for all tasks, already pulled. >Reporter: Michael Thomas > Labels: containerizer, docker, performance > > When scheduling more than one task on the same agent, all tasks fail a as > containers seem to be destroyed during provisioning. > Specifically, the errors on the agent logs are: > {noformat} > E0808 15:53:09.691315 30996 slave.cpp:3976] Container > 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework > c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being > destroyed during provisioning > {noformat} > and > {noformat} > I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of > framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not > register within 3mins > {noformat} > As the default provisioning method {{copy}} is being used, I assume this is > due to the provisioning of multiple containers taking too long and the agent > will not wait. For large images, this method is simply not performant. > The issue did not occur, when only one tasks was scheduled. > Increasing the {{executor_registration_timeout}} parameter, seemed to help a > bit as it allowed to schedule at least 2 tasks at the same time. But still > fails with more (5 in this case) > h4. Complete logs > (with GLOG_v=1) > {noformat} > Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375 > 3738 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 > Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403 > 3738 slave.cpp:199] Flags at startup: > --appc_simple_discovery_uri_prefix="http://; > --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" > --authenticate_http_readwrite="false" --authenticatee="crammd5" > --authentication_backoff_factor="1secs" --authorizer="local" > --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" > --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" > --cgroups_root="mesos" --container_disk_watch_interval="15secs" > --containerizers="mesos,docker" --default_role="*" > --disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" > --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; > --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" > --docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp > Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: > oint_dir="/var/run/mesos/isolators/docker/volume" > --enforce_container_disk_quota="false" > --executor_registration_timeout="1mins" > --executor_shutdown_grace_period="5secs" > --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" > --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" > --hadoop_home="" --help="false" > --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" > --hostname_lookup="true" --http_authenticators="basic" > --http_command_executor="false" --image_providers="docker" > --image_provisioner_backend="copy" --initialize_driver_logging="true" >
[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Thomas updated MESOS-6004: -- Description: When scheduling more than one task on the same agent, all tasks fail a as containers seem to be destroyed during provisioning. Specifically, the errors on the agent logs are: {noformat} E0808 15:53:09.691315 30996 slave.cpp:3976] Container 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being destroyed during provisioning {noformat} and {noformat} I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not register within 3mins {noformat} As the default provisioning method {{copy}} is being used, I assume this is due to the provisioning of multiple containers taking too long and the agent will not wait. For large images, this method is simply not performant. The issue did not occur, when only one tasks was scheduled. Increasing the {{executor_registration_timeout}} parameter, seemed to help a bit as it allowed to schedule at least 2 tasks at the same time. But still fails with more (5 in this case) h4. Complete logs (with GLOG_v=1) {noformat} Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375 3738 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403 3738 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos,docker" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: oint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_providers="docker" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --re Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: covery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/mnt/mesos" Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 2016-08-09 10:11:41,800:3738(0x7f71172a1700):ZOO_INFO@check_events@1728: initiated connection to server [172.31.19.240:2181] Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801007 3738 slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; disk(*):60257; ports(*):[31000-32000] Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801059 3738 slave.cpp:527] Agent attributes: [ ] Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801077 3738 slave.cpp:532] Agent hostname: ec2-52-59-113-0.eu-central-1.compute.amazonaws.com Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.802232 3772 state.cpp:57] Recovering state from '/mnt/mesos/meta' Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803539 3774 status_update_manager.cpp:200] Recovering status update manager Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803745 3772 containerizer.cpp:522] Recovering containerizer Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803788 3770 docker.cpp:775]
[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Thomas updated MESOS-6004: -- Description: When scheduling more than one task on the same agent, all tasks fail a as containers seem to be destroyed during provisioning. Specifically, the errors on the agent logs are: {noformat} E0808 15:53:09.691315 30996 slave.cpp:3976] Container 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being destroyed during provisioning {noformat} and {noformat} I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not register within 3mins {noformat} As the default provisioning method {{copy}} is being used, I assume this is due to the provisioning of multiple containers taking too long and the agent will not wait. For large images, this method is simply not performant. The issue did not occur, when only one tasks was scheduled. Increasing the {{executor_registration_timeout}} parameter, seemed to help a bit as it allowed to schedule at least 2 tasks at the same time. But still fails with more (5 in this case) h4. Complete logs (with GLOG_v=1) {noformat} Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375 3738 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403 3738 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos,docker" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: oint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="1mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_providers="docker" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --re Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: covery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/mnt/mesos" Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 2016-08-09 10:11:41,800:3738(0x7f71172a1700):ZOO_INFO@check_events@1728: initiated connection to server [172.31.19.240:2181] Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801007 3738 slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; disk(*):60257; ports(*):[31000-32000] Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801059 3738 slave.cpp:527] Agent attributes: [ ] Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801077 3738 slave.cpp:532] Agent hostname: ec2-52-59-113-0.eu-central-1.compute.amazonaws.com Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.802232 3772 state.cpp:57] Recovering state from '/mnt/mesos/meta' Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803539 3774 status_update_manager.cpp:200] Recovering status update manager Aug 9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803745 3772 containerizer.cpp:522]
[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Thomas updated MESOS-6004: -- Description: When scheduling more than one task on the same agent, all tasks fail a as containers seem to be destroyed during provisioning. Specifically, the errors on the agent logs are: {noformat} E0808 15:53:09.691315 30996 slave.cpp:3976] Container 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being destroyed during provisioning {noformat} and {noformat} I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not register within 3mins {noformat} As the default provisioning method {{copy}} is being used, I assume this is due to the provisioning of multiple containers taking too long and the agent will not wait. For large images, this method is simply not performant. The issue did not occur, when only one tasks was scheduled. Increasing the {{executor_registration_timeout}} parameter, seemed to help a bit as it allowed to schedule at least 2 tasks at the same time. But still fails with more (5 in this case) h4. Complete logs (with GLOG_v=0, as with 1 it was too long) {noformat} Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 main.cpp:434] Starting Mesos agent Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos,docker" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="3mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_providers="docker" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recov Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/mnt/mesos" Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662147 30961 slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; disk(*):60257; ports(*):[31000-32000] Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662211 30961 slave.cpp:527] Agent attributes: [ ] Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662230 30961 slave.cpp:532] Agent hostname: ec2-52-59-113-0.eu-central-1.compute.amazonaws.com Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663354 31000 state.cpp:57] Recovering state from '/mnt/mesos/meta' Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663918 30995 status_update_manager.cpp:200] Recovering status update manager Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664131 30996 containerizer.cpp:522] Recovering containerizer Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664136 31000 docker.cpp:775] Recovering
[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
[ https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Thomas updated MESOS-6004: -- Description: When scheduling more than one task on the same agent, all tasks fail a as containers seem to be destroyed during provisioning. Specifically, the errors on the agent logs are: {noformat} E0808 15:53:09.691315 30996 slave.cpp:3976] Container 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being destroyed during provisioning {noformat} and {noformat} I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not register within 3mins {noformat} As the default provisioning method {{copy}} is being used, I assume this is due to the provisioning of multiple containers taking too long and the agent will not wait. For large images, this method is simply not performant. The issue did not occur, when only one tasks was scheduled. Increasing the {{executor_registration_timeout}} parameter, seemed to help a bit as it allowed to schedule at least 2 tasks at the same time. But still fails with more (5 in this case) h4. Complete logs (with GLOG_v=0, as with 1 it was too long) {noformat} Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 main.cpp:434] Starting Mesos agent Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos,docker" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="3mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_providers="docker" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recov Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/mnt/mesos" Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662147 30961 slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; disk(*):60257; ports(*):[31000-32000] Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662211 30961 slave.cpp:527] Agent attributes: [ ] Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662230 30961 slave.cpp:532] Agent hostname: ec2-52-59-113-0.eu-central-1.compute.amazonaws.com Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663354 31000 state.cpp:57] Recovering state from '/mnt/mesos/meta' Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663918 30995 status_update_manager.cpp:200] Recovering status update manager Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664131 30996 containerizer.cpp:522] Recovering
[jira] [Created] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend
Michael Thomas created MESOS-6004: - Summary: Tasks fail when provisioning multiple containers with large docker images using copy backend Key: MESOS-6004 URL: https://issues.apache.org/jira/browse/MESOS-6004 Project: Mesos Issue Type: Bug Components: containerization, docker Affects Versions: 1.0.0 Environment: h4. Agent Platform - Ubuntu 16.04 - AWS g2.x2large instance - Nvidia support enabled h4. Agent Configuration -{noformat} --containerizers=mesos,docker --docker_config= --docker_store_dir=/mnt/mesos/store/docker --executor_registration_timeout=3mins --hostname= --image_providers=docker --image_provisioner_backend=copy --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia --switch_user=false --work_dir=/mnt/mesos {noformat} h4. Framework - custom framework written in python - using unified containerizer with docker images h4. Test Setup * 1 master * 1 agent * 5 tasks scheduled at the same time: ** resources: cpus: 0.1, mem: 128 ** command: `echo test` ** docker image: custom docker image, based on nvidia/cuda ~5gb ** the same docker image was for all tasks, already pulled. Reporter: Michael Thomas When scheduling more than one task on the same agent, all tasks fail a as containers seem to be destroyed during provisioning. Specifically, the errors on the agent logs are: {noformat} E0808 15:53:09.691315 30996 slave.cpp:3976] Container 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being destroyed during provisioning {noformat} and {noformat} I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not register within 3mins {noformat} As the default provisioning method `copy` is being used, I assume this is due to the provisioning of multiple containers taking too long and the agent will not wait. For large images, this method is simply not performant. The issue did not occur, when only one tasks was scheduled. Increasing the `executor_registration_timeout` parameter, seemed to help a bit as it allowed to schedule at least 2 tasks at the same time. But still fails with more (5 in this case) h4. Complete logs (with GLOG_v=0, as with 1 it to long) {noformat} Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 main.cpp:434] Starting Mesos agent Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051 Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos,docker" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="3mins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_providers="docker" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recov Aug 8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" --recovery_timeout="15mins" --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true"