Hi Miguel, I believe you are running into a current limitation of the observer: The executor checkpoints its process PIDs to disk and the observer reads them. The observer then tries to get process statistics for each PID. If those live within a PID namespace, then the observer won’t be able to see the correct processes.
A quick workaround would be to ensure your tasks are not running within a PID namespace. For the Mesos containerizer you have to stop using the `namespaces/pid` isolator. If you are using the Docker containerize you will have to tweak the corresponding Docker option. The correct fix would be for the observer to enter the task namespace before reading the stats. Best regards, Stephan From: Miguel Moreno <miguel.mor...@piksel.com> Reply-To: "user@aurora.apache.org" <user@aurora.apache.org> Date: Tuesday, 13. February 2018 at 08:51 To: "user@aurora.apache.org" <user@aurora.apache.org> Subject: Thermos Observer is showing 0.0 For CPU and Memory Hi everybody. We are running Aurora into AWS machines with Mesos agents. The jobs are working well and no error in the log but, when trying to look into the resources usage in the Thermos Observer endopint (hostname:1338) it always shows 0.0 for cpu and memory. Scrubbed __main__.log: Log file created at: 2018/02/07 09:45:08 Running on machine: - [DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg Command line: /srv/apps/mesos/var/sandbox/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox I0207 09:45:08.067967 12 executor_base.py:45] Executor [None]: registered() called with: I0207 09:45:08.068506 12 executor_base.py:45] Executor [None]: ExecutorInfo: executor_id { value: "executor_id" } resources { name: "cpus" type: SCALAR scalar { value: 0.25 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 128.0 } role: "*" } command { uris { value: "/usr/share/aurora/bin/thermos_executor.pex" executable: true } uris { value: "/etc/mesos-files/.docker/config.json" executable: true } uris { value: "/usr/share/aurora/bin/thermos_executor.pex" executable: true } value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox" } framework_id { value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000" } name: "AuroraExecutor" source: "source" container { type: DOCKER docker { image: "docker" } } labels { labels { key: "source" value: "source" } } I0207 09:45:08.068696 12 executor_base.py:45] Executor [None]: FrameworkInfo: user: "root" name: "Aurora" id { value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000" } failover_timeout: 1814400.0 checkpoint: true role: "batch" hostname: "host" principal: "aurora" capabilities { type: REVOCABLE_RESOURCES } capabilities { type: GPU_RESOURCES } I0207 09:45:08.069024 12 executor_base.py:45] Executor [None]: SlaveInfo: hostname: "hostnamel" resources { name: "cpus" type: SCALAR scalar { value: 7.5 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 14526.83 } role: "*" } resources { name: "ports" type: RANGES ranges { range { begin: 20000 end: 30000 } } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 95541.0 } role: "*" } attributes { name: "az" type: TEXT text { value: "eu-west-1b" } } attributes { name: "stage" type: TEXT text { value: "sandbox" } } id { value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179" } checkpoint: true port: 5051 I0207 09:45:08.070421 12 executor_base.py:45] Executor [None]: TaskInfo: name: "name" task_id { value: "task_id" } slave_id { value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179" } resources { name: "cpus" type: SCALAR scalar { value: 1.0 } role: "*" } resources { name: "disk" type: SCALAR scalar { value: 2048.0 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 1024.0 } role: "*" } executor { executor_id { value: "executor_id" } resources { name: "cpus" type: SCALAR scalar { value: 0.25 } role: "*" } resources { name: "mem" type: SCALAR scalar { value: 128.0 } role: "*" } command { uris { value: "/usr/share/aurora/bin/thermos_executor.pex" executable: true } uris { value: "/etc/mesos-files/.docker/config.json" executable: true } uris { value: "/usr/share/aurora/bin/thermos_executor.pex" executable: true } value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble (zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer --runner-logger-destination file --runner-logger-mode rotate --runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox" } framework_id { value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000" } name: "AuroraExecutor" source: "source" container { type: DOCKER docker { image: "docker" } } labels { labels { key: "source" value: "source" } } } data: “data" labels { labels { key: "org.apache.aurora.tier" value: "preemptible" } } discovery { visibility: CLUSTER name: "name" environment: "devel" location: "aurora-eu-west-1" } I0207 09:45:08.070538 12 executor_base.py:45] Executor [None]: launchTask got task: task 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Assigned task: AssignedTask(task=TaskConfig(isService=False, contactEmail=None, taskLinks={}, tier='preemptible', mesosFetcherUris=set([]), executorConfig=ExecutorConfig(data='{"environment": "devel", "health_check_config": {"health_checker": {"http": {"expected_response_code": 0, "endpoint": "/health", "expected_response": "ok"}}, "min_consecutive_successes": 1, "initial_interval_secs": 15.0, "max_consecutive_failures": 0, "timeout_secs": 1.0, "interval_secs": 10.0}, "container": {"image": "docker", "parameters": []}, "name": "name", "service": false, "max_task_failures": 1, "cron_collision_policy": "KILL_EXISTING", "enable_hooks": false, "cluster": "devcluster", "task": {"processes": [{processes, requestedPorts=set([]), maxTaskFailures=1, priority=1, ramMb=1024, job=JobKey(environment='devel', role='role', name='name'), production=False, diskMb=2048, resources=set([Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=2048, numCpus=None), Resource(ramMb=None, numGpus=None, namedPort=None, diskMb=None, numCpus=1.0), Resource(ramMb=1024, numGpus=None, namedPort=None, diskMb=None, numCpus=None)]), owner=Identity(user='user'), container=Container(docker=DockerContainer(image='ddocker', parameters=[]), mesos=None), metadata=set([]), numCpus=1.0, constraints=set([Constraint(name='stage', constraint=TaskConstraint(limit=None, value=ValueConstraint(negated=False, values=set(['sandbox']))))])), taskId='t_id', instanceId=0, assignedPorts={}, slaveHost='host', slaveId='id’) I0207 09:45:08.071496 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id=> STARTING I0207 09:45:08.071634 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Reason: Initializing sandbox. D0207 09:45:08.072333 12 sandbox.py:129] DirectorySandbox: mkdir /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox D0207 09:45:09.482642 12 ckpt.py:379] Running state machine for process=setup_environment/seq=0 D0207 09:45:09.482762 12 ckpt.py:379] Running state machine for process=setup_environment/seq=1 D0207 09:45:09.482841 12 recordio.py:137] /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data (current offset = 981) D0207 09:45:09.482932 12 resource.py:136] Initialising resource collection for task task_id D0207 09:45:09.483011 12 resource.py:147] Initialising ResourceHistory of length 180 D0207 09:45:09.620017 12 resource.py:187] Commencing resource monitoring for task "task_id" I0207 09:45:09.620552 12 aurora_executor.py:188] Send TASK_RUNNING status update. status: StatusResult('No health-check defined, task is assumed healthy.', status='TASK_RUNNING') I0207 09:45:09.620950 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id => RUNNING I0207 09:45:09.621148 12 executor_base.py:45] Executor [ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Reason: No health-check defined, task is assumed healthy. D0207 09:45:09.622956 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 0.8ms D0207 09:45:29.654671 12 ckpt.py:379] Running state machine for process=setup_environment/seq=2 D0207 09:45:29.654998 12 ckpt.py:379] Running state machine for process=setup_environment/seq=3 D0207 09:45:29.655204 12 ckpt.py:379] Running state machine for process=s3_download/seq=0 D0207 09:45:29.655424 12 ckpt.py:379] Running state machine for process=s3_download/seq=1 D0207 09:45:29.655628 12 ckpt.py:379] Running state machine for process=s3_download/seq=2 D0207 09:45:29.655834 12 ckpt.py:379] Running state machine for process=s3_download/seq=3 D0207 09:45:29.656034 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=0 D0207 09:45:29.656261 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=1 D0207 09:45:29.656470 12 ckpt.py:379] Running state machine for process=single_h265_transcode_file/seq=2 D0207 09:45:29.656608 12 recordio.py:137] /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data (current offset = 1569) D0207 09:45:49.657543 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24659524141 D0207 09:46:09.659521 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 1.3ms D0207 09:46:09.665703 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24998856702 D0207 09:46:29.758064 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24475260671 D0207 09:46:49.759083 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24843625817 D0207 09:47:09.756055 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 1.3ms D0207 09:47:09.759665 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24596081776 D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24940960757 D0207 09:47:49.761781 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24846117073 D0207 09:48:09.855884 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24563907521 D0207 09:48:09.862927 12 disk.py:44] DiskCollectorThread: finished collection of /srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox in 0.9ms D0207 09:48:29.957253 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24767839681 D0207 09:48:49.960608 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24828815758 W0207 09:45:08.216031 12 monitor.py:92] Could not read from checkpoint /srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner I0207 09:45:08.216214 12 thermos_task_runner.py:278] Forking off runner with cmdline: /usr/bin/python2.7 /srv/apps/mesos/var/sandbox/thermos_runner.pex --process_logger_mode=rotate --task_id=task_id --log_to_disk=DEBUG --rotate_log_size_mb=5 --hostname=host --rotate_log_backups=10 --thermos_json=/srv/apps/mesos/var/sandbox/task.json --sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox --log_dir=/srv/apps/mesos/var/sandbox --checkpoint_root=/srv/apps/mesos/var/sandbox/checkpoints –container_sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox --process_logger_destination=file D0207 09:45:08.218667 12 thermos_task_runner.py:291] Waiting for task to start. D0207 09:45:08.218888 12 thermos_task_runner.py:300] - sleeping... D0207 09:45:08.719671 12 thermos_task_runner.py:300] - sleeping... D0207 09:45:09.232539 12 aurora_executor.py:159] Task started. W0207 09:45:09.355899 12 health_checker.py:389] No health-checks defined, will use a no-op health-checker. D0207 09:45:09.482341 12 ckpt.py:348] Flipping task state from (undefined) to ACTIVE D0207 09:45:09.482521 12 ckpt.py:328] Initializing TaskRunner header to RunnerHeader(launch_time_ms=1517996708593, uid=0, task_id='task_id', hostname='host', sandbox='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox', log_dir='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox/.logs', ports={}, user='root') For what I see in : D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for pid=63 and children: 1.24940960757 It's actually calculating the CPU usage but it's not showing it. AWS agents have Ubuntu 14.04.5 LTS installed and docker containers where jobs are running, Debian 9. Thank you. Miguel.