[jira] [Commented] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-09 Thread Michael Thomas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15413374#comment-15413374
 ] 

Michael Thomas commented on MESOS-6004:
---


1. I don't think its the downloading, as the image was already cached locally. 
the image_provision_backend {{overlayer}} does not work for me, as I am missing 
the kernel extension of that.
2.  I updated the logs with GLOG_v=1

Also I tried with a smaller image (nvidia/cuda, 1.5GB 21 layers) and this works 
find and seems reasonable fast

>  Tasks fail when provisioning multiple containers with large docker images 
> using copy backend
> -
>
> Key: MESOS-6004
> URL: https://issues.apache.org/jira/browse/MESOS-6004
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.2, 1.0.0
> Environment: h4. Agent Platform
> - Ubuntu 16.04
> - AWS g2.x2large instance
> - Nvidia support enabled
> h4. Agent Configuration
> -{noformat}
> --containerizers=mesos,docker
> --docker_config=
> --docker_store_dir=/mnt/mesos/store/docker
> --executor_registration_timeout=3mins
> --hostname=
> --image_providers=docker
> --image_provisioner_backend=copy
> --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
> --switch_user=false
> --work_dir=/mnt/mesos
> {noformat}
> h4. Framework
> - custom framework written in python
> - using unified containerizer with docker images
> h4. Test Setup
> * 1 master
> * 1 agent
> * 5 tasks scheduled at the same time:
> ** resources: cpus: 0.1, mem: 128
> ** command: `echo test`
> ** docker image: custom docker image, based on nvidia/cuda ~5gb
> ** the same docker image was for all tasks, already pulled.
>Reporter: Michael Thomas
>  Labels: containerizer, docker, performance
>
> When scheduling more than one task on the same agent, all tasks fail a as 
> containers seem to be destroyed during provisioning.
> Specifically, the errors on the agent logs are:
> {noformat}
>  E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
> 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
> c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
> destroyed during provisioning
> {noformat}
> and 
> {noformat}
> I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
> framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
> register within 3mins
> {noformat}
> As the default provisioning method {{copy}} is being used, I assume this is 
> due to the provisioning of multiple containers taking too long and the agent 
> will not wait. For large images, this method is simply not performant.
> The issue did not occur, when only one tasks was scheduled.
> Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
> bit as it allowed to schedule at least 2 tasks at the same time. But still 
> fails with more (5 in this case)
> h4. Complete logs
> (with GLOG_v=1)
> {noformat}
> Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375  
> 3738 slave.cpp:198] Agent started on 1)@172.31.23.17:5051
> Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403  
> 3738 slave.cpp:199] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos,docker" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" 
> --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
> --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
> --docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp
> Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 
> oint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="1mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" 
> --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" --image_providers="docker" 
> --image_provisioner_backend="copy" --initialize_driver_logging="true" 
> 

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-09 Thread Michael Thomas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=1)

{noformat}
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375  3738 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403  3738 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" 
--docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 
oint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --re
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: covery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 2016-08-09 
10:11:41,800:3738(0x7f71172a1700):ZOO_INFO@check_events@1728: initiated 
connection to server [172.31.19.240:2181]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801007  3738 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801059  3738 
slave.cpp:527] Agent attributes: [  ]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801077  3738 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.802232  3772 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803539  3774 
status_update_manager.cpp:200] Recovering status update manager
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803745  3772 
containerizer.cpp:522] Recovering containerizer
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803788  3770 
docker.cpp:775] 

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-09 Thread Michael Thomas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=1)

{noformat}
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375  3738 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403  3738 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" 
--docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
 --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 
oint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --re
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: covery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 2016-08-09 
10:11:41,800:3738(0x7f71172a1700):ZOO_INFO@check_events@1728: initiated 
connection to server [172.31.19.240:2181]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801007  3738 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801059  3738 
slave.cpp:527] Agent attributes: [  ]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801077  3738 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.802232  3772 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803539  3774 
status_update_manager.cpp:200] Recovering status update manager
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803745  3772 
containerizer.cpp:522] 

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-09 Thread Michael Thomas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=0, as with 1 it was too long)

{noformat}
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 
main.cpp:434] Starting Mesos agent
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" 
--docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="3mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recov
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" 
--recovery_timeout="15mins" --registration_backoff_factor="1secs" 
--revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="false" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662147 30961 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662211 30961 
slave.cpp:527] Agent attributes: [  ]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662230 30961 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663354 31000 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663918 30995 
status_update_manager.cpp:200] Recovering status update manager
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664131 30996 
containerizer.cpp:522] Recovering containerizer
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664136 31000 
docker.cpp:775] Recovering 

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Michael Thomas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=0, as with 1 it was too long)

{noformat}
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 
main.cpp:434] Starting Mesos agent
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" 
--docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
 --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="3mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recov
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" 
--recovery_timeout="15mins" --registration_backoff_factor="1secs" 
--revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="false" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662147 30961 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662211 30961 
slave.cpp:527] Agent attributes: [  ]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662230 30961 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663354 31000 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663918 30995 
status_update_manager.cpp:200] Recovering status update manager
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664131 30996 
containerizer.cpp:522] Recovering 

[jira] [Created] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Michael Thomas (JIRA)
Michael Thomas created MESOS-6004:
-

 Summary:  Tasks fail when provisioning multiple containers with 
large docker images using copy backend
 Key: MESOS-6004
 URL: https://issues.apache.org/jira/browse/MESOS-6004
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Affects Versions: 1.0.0
 Environment: h4. Agent Platform

- Ubuntu 16.04
- AWS g2.x2large instance
- Nvidia support enabled

h4. Agent Configuration

-{noformat}
--containerizers=mesos,docker
--docker_config=
--docker_store_dir=/mnt/mesos/store/docker
--executor_registration_timeout=3mins
--hostname=
--image_providers=docker
--image_provisioner_backend=copy
--isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
--switch_user=false
--work_dir=/mnt/mesos
{noformat}

h4. Framework

- custom framework written in python
- using unified containerizer with docker images

h4. Test Setup

* 1 master
* 1 agent
* 5 tasks scheduled at the same time:
** resources: cpus: 0.1, mem: 128
** command: `echo test`
** docker image: custom docker image, based on nvidia/cuda ~5gb
** the same docker image was for all tasks, already pulled.

Reporter: Michael Thomas


When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method `copy` is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the `executor_registration_timeout` parameter, seemed to help a bit 
as it allowed to schedule at least 2 tasks at the same time. But still fails 
with more (5 in this case)



h4. Complete logs

(with GLOG_v=0, as with 1 it to long)

{noformat}
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 
main.cpp:434] Starting Mesos agent
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" 
--docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
 --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="3mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recov
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" 
--recovery_timeout="15mins" --registration_backoff_factor="1secs" 
--revocable_cpu_low_priority="true"