Hello,
I just tried to enable CUDA support but when it's done the slave refuse to start anything (marathon job stuck in deploying state). If I replace isolation setting from "cgroups/cpu,cgroups/mem,cgroups/devices,gpu/nvidia" to "cgroups/cpu,cgroups/mem,cgroups/devices" jobs get started again. Of course, I couldn't not find anything useful in the log file (attached). Can someone have a look and let me know if there's something broken/badly configured/whatever ? Thanks in advance, ? Best regards, Adam.
Jan 17 11:53:55 zelda rsyslogd: [origin software="rsyslogd" swVersion="8.4.2" x-pid="22044" x-info="http://www.rsyslog.com"] start Jan 17 11:53:55 zelda systemd[1]: Starting System Logging Service... Jan 17 11:53:55 zelda systemd[1]: Started System Logging Service. Jan 17 11:53:59 zelda systemd[1]: Stopping Mesos Slave... Jan 17 11:53:59 zelda mesos-slave[21762]: W0117 11:53:53.416450 21762 logging.cpp:91] RAW: Received signal SIGTERM from process 1 of user 0; exiting Jan 17 11:53:59 zelda systemd[1]: Starting Mesos Slave... Jan 17 11:53:59 zelda systemd[1]: Started Mesos Slave. Jan 17 11:53:59 zelda mesos-slave[22056]: WARNING: Logging before InitGoogleLogging() is written to STDERR Jan 17 11:53:59 zelda mesos-slave[22056]: I0117 11:53:59.801373 22056 main.cpp:243] Build: 2016-11-16 01:34:46 by admin Jan 17 11:53:59 zelda mesos-slave[22056]: I0117 11:53:59.801443 22056 main.cpp:244] Version: 1.1.0 Jan 17 11:53:59 zelda mesos-slave[22056]: I0117 11:53:59.801447 22056 main.cpp:247] Git tag: 1.1.0 Jan 17 11:53:59 zelda mesos-slave[22056]: I0117 11:53:59.801451 22056 main.cpp:251] Git SHA: a44b077ea0df54b77f05550979e1e97f39b15873 Jan 17 11:53:59 zelda mesos-slave[22056]: I0117 11:53:59.803274 22056 logging.cpp:194] INFO level logging started! Jan 17 11:53:59 zelda mesos-slave[22056]: I0117 11:53:59.807564 22056 systemd.cpp:238] systemd version `215` detected Jan 17 11:53:59 zelda mesos-slave[22056]: W0117 11:53:59.807608 22056 systemd.cpp:246] Required functionality `Delegate` was introduced in Version `218`. Your system may not function properly; however since some distributions have patched systemd packages, your system may still be functional. This is why we keep running. See MESOS-3352 for more information Jan 17 11:53:59 zelda mesos-slave[22056]: I0117 11:53:59.807770 22056 main.cpp:342] Inializing systemd state Jan 17 11:53:59 zelda systemd[1]: Created slice Mesos Executors Slice. Jan 17 11:53:59 zelda mesos-slave[22056]: I0117 11:53:59.813395 22056 systemd.cpp:326] Started systemd slice `mesos_executors.slice` Jan 17 11:53:59 zelda kernel: [ 2272.436156] ACPI Warning: \_SB.PCI1.QR2A.HHHL._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95) Jan 17 11:53:59 zelda kernel: [ 2272.441593] ACPI Warning: \_SB.PCI1.QR2A.HHHL._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95) Jan 17 11:53:59 zelda kernel: [ 2272.446951] ACPI Warning: \_SB.PCI1.QR2A.HHHL._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95) Jan 17 11:53:59 zelda kernel: [ 2272.451988] ACPI Warning: \_SB.PCI1.QR2A.HHHL._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95) Jan 17 11:53:59 zelda kernel: [ 2272.456946] ACPI Warning: \_SB.PCI1.QR2A.HHHL._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95) Jan 17 11:53:59 zelda kernel: [ 2272.461619] ACPI Warning: \_SB.PCI1.QR2A.HHHL._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95) Jan 17 11:53:59 zelda kernel: [ 2272.466296] ACPI Warning: \_SB.PCI1.QR2A.HHHL._DSM: Argument #4 type mismatch - Found [Buffer], ACPI requires [Package] (20160422/nsarguments-95) Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.628628 22056 containerizer.cpp:200] Using isolation: cgroups/cpu,cgroups/mem,cgroups/devices,gpu/nvidia,filesystem/posix,network/cni Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.635181 22056 linux_launcher.cpp:150] Using /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,663:22056(0x7fc456435700):ZOO_INFO@log_env@726: Client environment:zookeeper.version=zookeeper C client 3.4.8 Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,663:22056(0x7fc456435700):ZOO_INFO@log_env@730: Client environment:host.name=zelda.service.domain.com Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,663:22056(0x7fc456435700):ZOO_INFO@log_env@737: Client environment:os.name=Linux Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,663:22056(0x7fc456435700):ZOO_INFO@log_env@738: Client environment:os.arch=4.8.0-0.bpo.2-amd64 Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,663:22056(0x7fc456435700):ZOO_INFO@log_env@739: Client environment:os.version=#1 SMP Debian 4.8.11-1~bpo8+1 (2016-12-14) Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.664551 22056 slave.cpp:208] Mesos agent started on (1)@10.99.50.3:5051 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.664583 22056 slave.cpp:209] Flags at startup: --appc_simple_discovery_uri_prefix="http://" --appc_store_dir="/tmp/mesos/store/appc" --attributes="type:physical;location:ebrc;" --authenticate_http_readonly="false" --authenticate_http_readwrite="false" --authenticatee="crammd5" --authentication_backoff_factor="1secs" --authorizer="local" --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="true" --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="true" --cgroups_root="mesos" --container_disk_watch_interval="15secs" --containerizers="mesos,docker" --default_role="*" --disk_watch_interval="1mins" --docker="docker" --docker_kill_orphans="true" --docker_registry="https://registry-1.docker.io" --docker_remove_delay="6hrs" --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" --docker_store_dir="/tmp/mesos/store/docker" --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" --enforce_container_disk_quota="false" --executor_registration_timeout="5m Jan 17 11:54:00 zelda mesos-slave[22056]: ins" --executor_shutdown_grace_period="5secs" --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" --hadoop_home="" --help="false" --hostname="zelda.service.domain.com" --hostname_lookup="true" --http_authenticators="basic" --http_command_executor="false" --image_provisioner_backend="copy" --initialize_driver_logging="true" --isolation="cgroups/cpu,cgroups/mem,cgroups/devices,gpu/nvidia" --launcher="linux" --launcher_dir="/usr/libexec/mesos" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --master="zk://mario.service.domain.com:2181,luigi.service.domain.com:2181,zelda.service.domain.com:2181,bowser.service.domain.com:2181,toad.service.domain.com:2181/mesos" --max_completed_executors_per_framework="150" --oversubscribed_resources_interval="15secs" --perf_duration="10secs" --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" --quiet="false" --recover="reconnect" --recovery_timeout="15min Jan 17 11:54:00 zelda mesos-slave[22056]: s" --registration_backoff_factor="1secs" --resources="ports:[31000-32000];cpus:28;mem:102400" --revocable_cpu_low_priority="true" --runtime_dir="/var/run/mesos" --sandbox_directory="/mnt/mesos/sandbox" --strict="true" --switch_user="true" --systemd_enable_support="true" --systemd_runtime_directory="/run/systemd/system" --version="false" --work_dir="/var/lib/mesos" Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.665395 22056 slave.cpp:533] Agent resources: gpus(*):1; ports(*):[31000-32000]; cpus(*):28; mem(*):102400; disk(*):948088 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.665458 22056 slave.cpp:541] Agent attributes: [ type=physical, location=ebrc ] Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.665472 22056 slave.cpp:546] Agent hostname: zelda.service.domain.com Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,667:22056(0x7fc456435700):ZOO_INFO@log_env@747: Client environment:user.name=(null) Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,667:22056(0x7fc456435700):ZOO_INFO@log_env@755: Client environment:user.home=/root Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,667:22056(0x7fc456435700):ZOO_INFO@log_env@767: Client environment:user.dir=/ Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,667:22056(0x7fc456435700):ZOO_INFO@zookeeper_init@800: Initiating client connection, host=mario.service.domain.com:2181,luigi.service.domain.com:2181,zelda.service.domain.com:2181,bowser.service.domain.com:2181,toad.service.domain.com:2181 sessionTimeout=10000 watcher=0x7fc46695e880 sessionId=0 sessionPasswd=<null> context=0x7fc3fc000930 flags=0 Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,669:22056(0x7fc44daa8700):ZOO_INFO@check_events@1728: initiated connection to server [10.99.50.1:2181] Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.669729 22081 state.cpp:57] Recovering state from '/var/lib/mesos/meta' Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.670692 22108 status_update_manager.cpp:203] Recovering status update manager Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.670945 22105 docker.cpp:764] Recovering Docker containers Jan 17 11:54:00 zelda mesos-slave[22056]: 2017-01-17 11:54:00,671:22056(0x7fc44daa8700):ZOO_INFO@check_events@1775: session establishment complete on server [10.99.50.1:2181], sessionId=0x1584385911a0074, negotiated timeout=10000 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.672044 22081 containerizer.cpp:555] Recovering containerizer Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.672631 22098 group.cpp:340] Group process (zookeeper-group(1)@10.99.50.3:5051) connected to ZooKeeper Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.672693 22098 group.cpp:828] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0) Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.672719 22098 group.cpp:418] Trying to create path '/mesos' in ZooKeeper Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.673810 22108 detector.cpp:152] Detected a new leader: (id='280') Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.673925 22087 group.cpp:697] Trying to get '/mesos/json.info_0000000280' in ZooKeeper Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.674769 22083 zookeeper.cpp:259] A new leading master ([email protected]:5050) is detected Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.676254 22085 provisioner.cpp:253] Provisioner recovery complete Jan 17 11:54:00 zelda docker[1649]: time="2017-01-17T11:54:00+01:00" level=info msg="GET /v1.18/containers/json?all=1" Jan 17 11:54:00 zelda docker[1649]: time="2017-01-17T11:54:00+01:00" level=info msg="+job containers()" Jan 17 11:54:00 zelda docker[1649]: time="2017-01-17T11:54:00+01:00" level=info msg="-job containers() = OK (0)" Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.729894 22106 slave.cpp:5281] Finished recovery Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730274 22106 slave.cpp:5314] Garbage collecting old agent bc7d3ba7-dd73-4ef6-a2fe-0000aa265a34-S20 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730365 22106 slave.cpp:5314] Garbage collecting old agent 12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S0 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730372 22102 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/bc7d3ba7-dd73-4ef6-a2fe-0000aa265a34-S20' for gc 6.99999154731556days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730408 22106 slave.cpp:5314] Garbage collecting old agent 12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S1 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730463 22102 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/bc7d3ba7-dd73-4ef6-a2fe-0000aa265a34-S20' for gc 6.99999154685037days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730473 22106 slave.cpp:5314] Garbage collecting old agent 12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S2 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730499 22102 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S0' for gc 6.99999154647111days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730521 22102 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S0' for gc 6.99999154627259days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730532 22106 slave.cpp:5314] Garbage collecting old agent 12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S3 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730540 22102 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S1' for gc 6.99999154586667days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730559 22102 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S1' for gc 6.9999915455437days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730571 22106 slave.cpp:5314] Garbage collecting old agent ca5ab23a-0662-4118-a97b-39a84ec4d9ac-S0 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730577 22102 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S2' for gc 6.99999154508148days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730597 22102 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S2' for gc 6.99999154485037days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730613 22102 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S3' for gc 6.9999915445837days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730629 22102 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/12c5c6aa-eb9a-456b-be15-3dc1ea56691e-S3' for gc 6.99999154440889days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730643 22102 gc.cpp:55] Scheduling '/var/lib/mesos/slaves/ca5ab23a-0662-4118-a97b-39a84ec4d9ac-S0' for gc 6.99999154417778days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730659 22102 gc.cpp:55] Scheduling '/var/lib/mesos/meta/slaves/ca5ab23a-0662-4118-a97b-39a84ec4d9ac-S0' for gc 6.99999154399407days in the future Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730721 22085 status_update_manager.cpp:177] Pausing sending status updates Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730732 22106 slave.cpp:915] New master detected at [email protected]:5050 Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730762 22106 slave.cpp:936] No credentials provided. Attempting to register without authentication Jan 17 11:54:00 zelda mesos-slave[22056]: I0117 11:54:00.730775 22106 slave.cpp:947] Detecting new master Jan 17 11:54:01 zelda mesos-slave[22056]: I0117 11:54:01.350430 22095 slave.cpp:1217] Re-registered with master [email protected]:5050 Jan 17 11:54:01 zelda mesos-slave[22056]: I0117 11:54:01.350534 22095 slave.cpp:1253] Forwarding total oversubscribed resources {} Jan 17 11:54:01 zelda mesos-slave[22056]: I0117 11:54:01.350538 22106 status_update_manager.cpp:184] Resuming sending status updates Jan 17 11:54:06 zelda snmpd[1736]: Connection from UDP: [10.99.40.10]:52855->[10.99.50.3]:161 Jan 17 11:54:28 zelda snmpd[1736]: Connection from UDP: [10.99.40.10]:36156->[10.99.50.3]:161 Jan 17 11:54:28 zelda snmpd[1736]: Connection from UDP: [10.99.40.10]:36156->[10.99.50.3]:161 Jan 17 11:54:57 zelda systemd-timesyncd[1301]: interval/delta/delay/jitter/drift 2048s/-0.014s/0.002s/0.032s/+39ppm Jan 17 11:55:00 zelda mesos-slave[22056]: I0117 11:55:00.665974 22103 slave.cpp:5044] Current disk usage 19.47%. Max allowed age: 4.937068213728113days Jan 17 11:55:02 zelda mesos-slave[22056]: I0117 11:55:02.766595 22109 http.cpp:277] HTTP GET for /slave(1)/state from 192.168.178.52:53758 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36' Jan 17 11:55:09 zelda mesos-slave[22056]: I0117 11:55:09.377148 22085 http.cpp:277] HTTP GET for /slave(1)/state from 192.168.178.52:53758 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36' Jan 17 11:55:19 zelda mesos-slave[22056]: I0117 11:55:19.406291 22080 http.cpp:277] HTTP GET for /slave(1)/state from 192.168.178.52:53759 with User-Agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.75 Safari/537.36' Jan 17 11:56:00 zelda mesos-slave[22056]: I0117 11:56:00.666875 22085 slave.cpp:5044] Current disk usage 19.47%. Max allowed age: 4.937067267089572days

