Hi Miguel,

I believe you are running into a current limitation of the observer: The 
executor checkpoints its process PIDs to disk and the observer reads them. The 
observer then tries to get process statistics for each PID. If those live 
within a PID namespace, then the observer won’t be able to see the correct 
processes.

A quick workaround would be to ensure your tasks are not running within a PID 
namespace. For the Mesos containerizer you have to stop using the 
`namespaces/pid` isolator. If you are using the Docker containerize you will 
have to tweak the corresponding Docker option.

The correct fix would be for the observer to enter the task namespace before 
reading the stats.

Best regards,
Stephan


From: Miguel Moreno <miguel.mor...@piksel.com>
Reply-To: "user@aurora.apache.org" <user@aurora.apache.org>
Date: Tuesday, 13. February 2018 at 08:51
To: "user@aurora.apache.org" <user@aurora.apache.org>
Subject: Thermos Observer is showing 0.0 For CPU and Memory


Hi everybody.



We are running Aurora into AWS machines with Mesos agents.



The jobs are working well and no error in the log but, when trying to look into 
the resources usage in the Thermos Observer endopint (hostname:1338) it always 
shows 0.0 for cpu and memory.



Scrubbed __main__.log:


Log file created at: 2018/02/07 09:45:08
Running on machine: -
[DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg
Command line: /srv/apps/mesos/var/sandbox/thermos_executor.pex 
--announcer-ensemble (zookeepers) --mesos-containerizer-path 
/usr/libexec/mesos/mesos-containerizer --runner-logger-destination file 
--runner-logger-mode rotate --runner-rotate-log-backups 10 
--runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox
I0207 09:45:08.067967 12 executor_base.py:45] Executor [None]: registered() 
called with:
I0207 09:45:08.068506 12 executor_base.py:45] Executor [None]:    ExecutorInfo: 
 executor_id {
  value: "executor_id"
}
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 0.25
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 128.0
  }
  role: "*"
}
command {
  uris {
    value: "/usr/share/aurora/bin/thermos_executor.pex"
    executable: true
  }
  uris {
    value: "/etc/mesos-files/.docker/config.json"
    executable: true
  }
  uris {
    value: "/usr/share/aurora/bin/thermos_executor.pex"
    executable: true
  }
  value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble 
(zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer 
--runner-logger-destination file --runner-logger-mode rotate 
--runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 
--sandbox-mount-point /srv/apps/mesos/var/sandbox"
}
framework_id {
  value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
name: "AuroraExecutor"
source: "source"
container {
  type: DOCKER
  docker {
    image: "docker"
  }
}
labels {
  labels {
    key: "source"
    value: "source"
  }
}
I0207 09:45:08.068696 12 executor_base.py:45] Executor [None]:    
FrameworkInfo: user: "root"
name: "Aurora"
id {
  value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
failover_timeout: 1814400.0
checkpoint: true
role: "batch"
hostname: "host"
principal: "aurora"
capabilities {
  type: REVOCABLE_RESOURCES
}
capabilities {
  type: GPU_RESOURCES
}
I0207 09:45:08.069024 12 executor_base.py:45] Executor [None]:    SlaveInfo:    
 hostname: "hostnamel"
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 7.5
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 14526.83
  }
  role: "*"
}
resources {
  name: "ports"
  type: RANGES
  ranges {
    range {
      begin: 20000
      end: 30000
    }
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 95541.0
  }
  role: "*"
}
attributes {
  name: "az"
  type: TEXT
  text {
    value: "eu-west-1b"
  }
}
attributes {
  name: "stage"
  type: TEXT
  text {
    value: "sandbox"
  }
}
id {
  value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
checkpoint: true
port: 5051
I0207 09:45:08.070421 12 executor_base.py:45] Executor [None]: TaskInfo: name: 
"name"
task_id {
  value: "task_id"
}
slave_id {
  value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
resources {
  name: "cpus"
  type: SCALAR
  scalar {
    value: 1.0
  }
  role: "*"
}
resources {
  name: "disk"
  type: SCALAR
  scalar {
    value: 2048.0
  }
  role: "*"
}
resources {
  name: "mem"
  type: SCALAR
  scalar {
    value: 1024.0
  }
  role: "*"
}
executor {
  executor_id {
    value: "executor_id"
  }
  resources {
    name: "cpus"
    type: SCALAR
    scalar {
 value: 0.25
    }
    role: "*"
  }
  resources {
    name: "mem"
    type: SCALAR
    scalar {
      value: 128.0
    }
    role: "*"
  }
  command {
    uris {
      value: "/usr/share/aurora/bin/thermos_executor.pex"
      executable: true
    }
    uris {
      value: "/etc/mesos-files/.docker/config.json"
      executable: true
    }
    uris {
      value: "/usr/share/aurora/bin/thermos_executor.pex"
      executable: true
    }
    value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble 
(zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer 
--runner-logger-destination file --runner-logger-mode rotate 
--runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5 
--sandbox-mount-point /srv/apps/mesos/var/sandbox"
  }
  framework_id {
    value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
  }
  name: "AuroraExecutor"
  source: "source"
  container {
    type: DOCKER
    docker {
      image: "docker"
    }
  }
  labels {
    labels {
      key: "source"
      value: "source"
    }
  }
}
data: “data"
labels {
  labels {
    key: "org.apache.aurora.tier"
    value: "preemptible"
  }
}
discovery {
  visibility: CLUSTER
  name: "name"
  environment: "devel"
  location: "aurora-eu-west-1"
}
I0207 09:45:08.070538 12 executor_base.py:45] Executor [None]: launchTask got 
task: task 12 executor_base.py:45] Executor 
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Assigned task: 
AssignedTask(task=TaskConfig(isService=False, contactEmail=None, taskLinks={}, 
tier='preemptible', mesosFetcherUris=set([]), 
executorConfig=ExecutorConfig(data='{"environment": "devel", 
"health_check_config": {"health_checker": {"http": {"expected_response_code": 
0, "endpoint": "/health", "expected_response": "ok"}}, 
"min_consecutive_successes": 1, "initial_interval_secs": 15.0, 
"max_consecutive_failures": 0, "timeout_secs": 1.0, "interval_secs": 10.0}, 
"container": {"image": "docker", "parameters": []}, "name": "name", "service": 
false, "max_task_failures": 1, "cron_collision_policy": "KILL_EXISTING", 
"enable_hooks": false, "cluster": "devcluster", "task": {"processes": 
[{processes, requestedPorts=set([]), maxTaskFailures=1, priority=1, ramMb=1024, 
job=JobKey(environment='devel', role='role', name='name'), production=False, 
diskMb=2048, resources=set([Resource(ramMb=None, numGpus=None, namedPort=None, 
diskMb=2048, numCpus=None), Resource(ramMb=None, numGpus=None, namedPort=None, 
diskMb=None, numCpus=1.0), Resource(ramMb=1024, numGpus=None, namedPort=None, 
diskMb=None, numCpus=None)]), owner=Identity(user='user'), 
container=Container(docker=DockerContainer(image='ddocker', parameters=[]), 
mesos=None), metadata=set([]), numCpus=1.0, 
constraints=set([Constraint(name='stage', constraint=TaskConstraint(limit=None, 
value=ValueConstraint(negated=False, values=set(['sandbox']))))])), 
taskId='t_id', instanceId=0, assignedPorts={}, slaveHost='host', slaveId='id’)
I0207 09:45:08.071496 12 executor_base.py:45] Executor 
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id=> STARTING
I0207 09:45:08.071634 12 executor_base.py:45] Executor 
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]:    Reason: Initializing sandbox.
D0207 09:45:08.072333 12 sandbox.py:129] DirectorySandbox: mkdir 
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
D0207 09:45:09.482642 12 ckpt.py:379] Running state machine for 
process=setup_environment/seq=0
D0207 09:45:09.482762 12 ckpt.py:379] Running state machine for 
process=setup_environment/seq=1
D0207 09:45:09.482841 12 recordio.py:137] 
/srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data 
(current offset = 981)
D0207 09:45:09.482932 12 resource.py:136] Initialising resource collection for 
task task_id
D0207 09:45:09.483011 12 resource.py:147] Initialising ResourceHistory of 
length 180
D0207 09:45:09.620017 12 resource.py:187] Commencing resource monitoring for 
task "task_id"
I0207 09:45:09.620552 12 aurora_executor.py:188] Send TASK_RUNNING status 
update. status: StatusResult('No health-check defined, task is assumed 
healthy.', status='TASK_RUNNING')
I0207 09:45:09.620950 12 executor_base.py:45] Executor 
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id => RUNNING
I0207 09:45:09.621148 12 executor_base.py:45] Executor 
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]:    Reason: No health-check 
defined, task is assumed healthy.
D0207 09:45:09.622956 12 disk.py:44] DiskCollectorThread: finished collection 
of 
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
 in 0.8ms
D0207 09:45:29.654671 12 ckpt.py:379] Running state machine for 
process=setup_environment/seq=2
D0207 09:45:29.654998 12 ckpt.py:379] Running state machine for 
process=setup_environment/seq=3
D0207 09:45:29.655204 12 ckpt.py:379] Running state machine for 
process=s3_download/seq=0
D0207 09:45:29.655424 12 ckpt.py:379] Running state machine for 
process=s3_download/seq=1
D0207 09:45:29.655628 12 ckpt.py:379] Running state machine for 
process=s3_download/seq=2
D0207 09:45:29.655834 12 ckpt.py:379] Running state machine for 
process=s3_download/seq=3
D0207 09:45:29.656034 12 ckpt.py:379] Running state machine for 
process=single_h265_transcode_file/seq=0
D0207 09:45:29.656261 12 ckpt.py:379] Running state machine for 
process=single_h265_transcode_file/seq=1
D0207 09:45:29.656470 12 ckpt.py:379] Running state machine for 
process=single_h265_transcode_file/seq=2
D0207 09:45:29.656608 12 recordio.py:137] 
/srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data 
(current offset = 1569)
D0207 09:45:49.657543 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24659524141
D0207 09:46:09.659521 12 disk.py:44] DiskCollectorThread: finished collection 
of 
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
 in 1.3ms
D0207 09:46:09.665703 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24998856702
D0207 09:46:29.758064 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24475260671
D0207 09:46:49.759083 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24843625817
D0207 09:47:09.756055 12 disk.py:44] DiskCollectorThread: finished collection 
of 
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
 in 1.3ms
D0207 09:47:09.759665 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24596081776
D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24940960757
D0207 09:47:49.761781 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24846117073
D0207 09:48:09.855884 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24563907521
D0207 09:48:09.862927 12 disk.py:44] DiskCollectorThread: finished collection 
of 
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
 in 0.9ms
D0207 09:48:29.957253 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24767839681
D0207 09:48:49.960608 12 process_collector_psutil.py:93] Calculated rate for 
pid=63 and children: 1.24828815758
W0207 09:45:08.216031 12 monitor.py:92] Could not read from checkpoint 
/srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner
I0207 09:45:08.216214 12 thermos_task_runner.py:278] Forking off runner with 
cmdline: /usr/bin/python2.7 /srv/apps/mesos/var/sandbox/thermos_runner.pex 
--process_logger_mode=rotate --task_id=task_id --log_to_disk=DEBUG 
--rotate_log_size_mb=5 --hostname=host --rotate_log_backups=10 
--thermos_json=/srv/apps/mesos/var/sandbox/task.json 
--sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
 --log_dir=/srv/apps/mesos/var/sandbox 
--checkpoint_root=/srv/apps/mesos/var/sandbox/checkpoints 
–container_sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
 --process_logger_destination=file
D0207 09:45:08.218667 12 thermos_task_runner.py:291] Waiting for task to start.
D0207 09:45:08.218888 12 thermos_task_runner.py:300]   - sleeping...
D0207 09:45:08.719671 12 thermos_task_runner.py:300]   - sleeping...
D0207 09:45:09.232539 12 aurora_executor.py:159] Task started.
W0207 09:45:09.355899 12 health_checker.py:389] No health-checks defined, will 
use a no-op health-checker.
D0207 09:45:09.482341 12 ckpt.py:348] Flipping task state from (undefined) to 
ACTIVE
D0207 09:45:09.482521 12 ckpt.py:328] Initializing TaskRunner header to 
RunnerHeader(launch_time_ms=1517996708593, uid=0, task_id='task_id', 
hostname='host', 
sandbox='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox',
 
log_dir='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox/.logs',
 ports={}, user='root')







For what I see in : D0207 09:47:29.761161 12 process_collector_psutil.py:93] 
Calculated rate for pid=63 and children: 1.24940960757



It's actually calculating the CPU usage but it's not showing it.



AWS agents have Ubuntu 14.04.5 LTS installed and docker containers where jobs 
are running, Debian 9.



Thank you.
Miguel.

Reply via email to