subject:"\[jira\] \[Updated\] \(MESOS\-6004\) Tasks fail when provisioning multiple containers with large docker images using copy backend"

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-09 Thread Michael Thomas (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=1)

{noformat}
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375  3738 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403  3738 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" 
--docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 
oint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --re
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: covery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 2016-08-09 
10:11:41,800:3738(0x7f71172a1700):ZOO_INFO@check_events@1728: initiated 
connection to server [172.31.19.240:2181]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801007  3738 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801059  3738 
slave.cpp:527] Agent attributes: [  ]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801077  3738 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.802232  3772 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803539  3774 
status_update_manager.cpp:200] Recovering status update manager
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803745  3772 
containerizer.cpp:522] Recovering containerizer
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803788  3770 
docker.cpp:775]

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-09 Thread Michael Thomas (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=1)

{noformat}
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800375  3738 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.800403  3738 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" 
--docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
 --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/t" --docker_volume_checkp
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 
oint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="1mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recover="reconnect" --re
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: covery_timeout="15mins" 
--registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
--sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="false" 
--systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: 2016-08-09 
10:11:41,800:3738(0x7f71172a1700):ZOO_INFO@check_events@1728: initiated 
connection to server [172.31.19.240:2181]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801007  3738 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801059  3738 
slave.cpp:527] Agent attributes: [  ]
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.801077  3738 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.802232  3772 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803539  3774 
status_update_manager.cpp:200] Recovering status update manager
Aug  9 10:11:41 ip-172-31-23-17 mesos-slave[3738]: I0809 10:11:41.803745  3772 
containerizer.cpp:522]

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-09 Thread Michael Thomas (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=0, as with 1 it was too long)

{noformat}
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 
main.cpp:434] Starting Mesos agent
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" --docker_config="XXX" 
--docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="3mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recov
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" 
--recovery_timeout="15mins" --registration_backoff_factor="1secs" 
--revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="false" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662147 30961 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662211 30961 
slave.cpp:527] Agent attributes: [  ]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662230 30961 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663354 31000 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663918 30995 
status_update_manager.cpp:200] Recovering status update manager
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664131 30996 
containerizer.cpp:522] Recovering containerizer
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664136 31000 
docker.cpp:775] Recovering

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Gilbert Song (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6004:

Affects Version/s: 0.28.2

>  Tasks fail when provisioning multiple containers with large docker images 
> using copy backend
> -
>
> Key: MESOS-6004
> URL: https://issues.apache.org/jira/browse/MESOS-6004
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.28.2, 1.0.0
> Environment: h4. Agent Platform
> - Ubuntu 16.04
> - AWS g2.x2large instance
> - Nvidia support enabled
> h4. Agent Configuration
> -{noformat}
> --containerizers=mesos,docker
> --docker_config=
> --docker_store_dir=/mnt/mesos/store/docker
> --executor_registration_timeout=3mins
> --hostname=
> --image_providers=docker
> --image_provisioner_backend=copy
> --isolation=filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia
> --switch_user=false
> --work_dir=/mnt/mesos
> {noformat}
> h4. Framework
> - custom framework written in python
> - using unified containerizer with docker images
> h4. Test Setup
> * 1 master
> * 1 agent
> * 5 tasks scheduled at the same time:
> ** resources: cpus: 0.1, mem: 128
> ** command: `echo test`
> ** docker image: custom docker image, based on nvidia/cuda ~5gb
> ** the same docker image was for all tasks, already pulled.
>Reporter: Michael Thomas
>  Labels: containerizer, docker, performance
>
> When scheduling more than one task on the same agent, all tasks fail a as 
> containers seem to be destroyed during provisioning.
> Specifically, the errors on the agent logs are:
> {noformat}
>  E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
> 'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
> c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
> destroyed during provisioning
> {noformat}
> and 
> {noformat}
> I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
> framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
> register within 3mins
> {noformat}
> As the default provisioning method {{copy}} is being used, I assume this is 
> due to the provisioning of multiple containers taking too long and the agent 
> will not wait. For large images, this method is simply not performant.
> The issue did not occur, when only one tasks was scheduled.
> Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
> bit as it allowed to schedule at least 2 tasks at the same time. But still 
> fails with more (5 in this case)
> h4. Complete logs
> (with GLOG_v=0, as with 1 it was too long)
> {noformat}
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 
> 30961 main.cpp:434] Starting Mesos agent
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 
> 30961 slave.cpp:198] Agent started on 1)@172.31.23.17:5051
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 
> 30961 slave.cpp:199] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos,docker" --default_role="*" 
> --disk_watch_interval="1mins" --docker="docker" 
> --docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
>  --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/mnt/mesos/store/docker" --do
> Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
> cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_registration_timeout="3mins" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" 
> --hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
> --hostname_lookup="true" --http_authenticators="basic" 
> --http_command_executor="false" --image_providers="docker" 
> --image_provisioner_backend="copy" --initialize_driver_logging="true" 
> --isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
>

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

2016-08-08 Thread Michael Thomas (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-6004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Thomas updated MESOS-6004:
--
Description: 
When scheduling more than one task on the same agent, all tasks fail a as 
containers seem to be destroyed during provisioning.

Specifically, the errors on the agent logs are:

{noformat}
 E0808 15:53:09.691315 30996 slave.cpp:3976] Container 
'eb20f642-bb90-4293-8eec-6f1576ccaeb1' for executor '3' of framework 
c9852a23-bc07-422d-8d69-23c167a1924d-0001 failed to start: Container is being 
destroyed during provisioning
{noformat}

and 

{noformat}
I0808 15:52:32.510210 30999 slave.cpp:4539] Terminating executor ''2' of 
framework c9852a23-bc07-422d-8d69-23c167a1924d-0001' because it did not 
register within 3mins
{noformat}

As the default provisioning method {{copy}} is being used, I assume this is due 
to the provisioning of multiple containers taking too long and the agent will 
not wait. For large images, this method is simply not performant.

The issue did not occur, when only one tasks was scheduled.

Increasing the {{executor_registration_timeout}} parameter, seemed to help a 
bit as it allowed to schedule at least 2 tasks at the same time. But still 
fails with more (5 in this case)



h4. Complete logs

(with GLOG_v=0, as with 1 it was too long)

{noformat}
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661067 30961 
main.cpp:434] Starting Mesos agent
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661551 30961 
slave.cpp:198] Agent started on 1)@172.31.23.17:5051
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.661578 30961 
slave.cpp:199] Flags at startup: --appc_simple_discovery_uri_prefix="http://; 
--appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
--authenticate_http_readwrite="false" --authenticatee="crammd5" 
--authentication_backoff_factor="1secs" --authorizer="local" 
--cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
--cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
--cgroups_root="mesos" --container_disk_watch_interval="15secs" 
--containerizers="mesos,docker" --default_role="*" 
--disk_watch_interval="1mins" --docker="docker" 
--docker_config="{"auths":{"https:\/\/index.docker.io\/v1\/":{"auth":"dGVycmFsb3VwZTpUYWxFWUFOSXR5","email":"sebastian.ge...@terraloupe.com"}}}"
 --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io; 
--docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" 
--docker_stop_timeout="0ns" --docker_store_dir="/mnt/mesos/store/docker" --do
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: 
cker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
--enforce_container_disk_quota="false" --executor_registration_timeout="3mins" 
--executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" 
--fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" 
--gc_disk_headroom="0.1" --hadoop_home="" --help="false" 
--hostname="ec2-52-59-113-0.eu-central-1.compute.amazonaws.com" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_command_executor="false" --image_providers="docker" 
--image_provisioner_backend="copy" --initialize_driver_logging="true" 
--isolation="filesystem/linux,docker/runtime,cgroups/devices,gpu/nvidia" 
--launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" 
--logging_level="INFO" --master="zk://172.31.19.240:2181/mesos" 
--oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
--perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
--quiet="false" --recov
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: er="reconnect" 
--recovery_timeout="15mins" --registration_backoff_factor="1secs" 
--revocable_cpu_low_priority="true" --sandbox_directory="/mnt/mesos/sandbox" 
--strict="true" --switch_user="false" --systemd_enable_support="true" 
--systemd_runtime_directory="/run/systemd/system" --version="false" 
--work_dir="/mnt/mesos"
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662147 30961 
slave.cpp:519] Agent resources: gpus(*):1; cpus(*):8; mem(*):14014; 
disk(*):60257; ports(*):[31000-32000]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662211 30961 
slave.cpp:527] Agent attributes: [  ]
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.662230 30961 
slave.cpp:532] Agent hostname: 
ec2-52-59-113-0.eu-central-1.compute.amazonaws.com
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663354 31000 
state.cpp:57] Recovering state from '/mnt/mesos/meta'
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.663918 30995 
status_update_manager.cpp:200] Recovering status update manager
Aug  8 15:48:32 ip-172-31-23-17 mesos-slave[30961]: I0808 15:48:32.664131 30996 
containerizer.cpp:522] Recovering

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

[jira] [Updated] (MESOS-6004) Tasks fail when provisioning multiple containers with large docker images using copy backend

5 matches

Site Navigation

Mail list logo

Footer information