Hi everybody.
We are running Aurora into AWS machines with Mesos agents.
The jobs are working well and no error in the log but, when trying to look into
the resources usage in the Thermos Observer endopint (hostname:1338) it always
shows 0.0 for cpu and memory.
Scrubbed __main__.log:
Log file created at: 2018/02/07 09:45:08
Running on machine: -
[DIWEF]mmdd hh:mm:ss.uuuuuu pid file:line] msg
Command line: /srv/apps/mesos/var/sandbox/thermos_executor.pex
--announcer-ensemble (zookeepers) --mesos-containerizer-path
/usr/libexec/mesos/mesos-containerizer --runner-logger-destination file
--runner-logger-mode rotate --runner-rotate-log-backups 10
--runner-rotate-log-size-mb 5 --sandbox-mount-point /srv/apps/mesos/var/sandbox
I0207 09:45:08.067967 12 executor_base.py:45] Executor [None]: registered()
called with:
I0207 09:45:08.068506 12 executor_base.py:45] Executor [None]: ExecutorInfo:
executor_id {
value: "executor_id"
}
resources {
name: "cpus"
type: SCALAR
scalar {
value: 0.25
}
role: "*"
}
resources {
name: "mem"
type: SCALAR
scalar {
value: 128.0
}
role: "*"
}
command {
uris {
value: "/usr/share/aurora/bin/thermos_executor.pex"
executable: true
}
uris {
value: "/etc/mesos-files/.docker/config.json"
executable: true
}
uris {
value: "/usr/share/aurora/bin/thermos_executor.pex"
executable: true
}
value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
(zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer
--runner-logger-destination file --runner-logger-mode rotate
--runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5
--sandbox-mount-point /srv/apps/mesos/var/sandbox"
}
framework_id {
value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
name: "AuroraExecutor"
source: "source"
container {
type: DOCKER
docker {
image: "docker"
}
}
labels {
labels {
key: "source"
value: "source"
}
}
I0207 09:45:08.068696 12 executor_base.py:45] Executor [None]:
FrameworkInfo: user: "root"
name: "Aurora"
id {
value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
failover_timeout: 1814400.0
checkpoint: true
role: "batch"
hostname: "host"
principal: "aurora"
capabilities {
type: REVOCABLE_RESOURCES
}
capabilities {
type: GPU_RESOURCES
}
I0207 09:45:08.069024 12 executor_base.py:45] Executor [None]: SlaveInfo:
hostname: "hostnamel"
resources {
name: "cpus"
type: SCALAR
scalar {
value: 7.5
}
role: "*"
}
resources {
name: "mem"
type: SCALAR
scalar {
value: 14526.83
}
role: "*"
}
resources {
name: "ports"
type: RANGES
ranges {
range {
begin: 20000
end: 30000
}
}
role: "*"
}
resources {
name: "disk"
type: SCALAR
scalar {
value: 95541.0
}
role: "*"
}
attributes {
name: "az"
type: TEXT
text {
value: "eu-west-1b"
}
}
attributes {
name: "stage"
type: TEXT
text {
value: "sandbox"
}
}
id {
value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
checkpoint: true
port: 5051
I0207 09:45:08.070421 12 executor_base.py:45] Executor [None]: TaskInfo: name:
"name"
task_id {
value: "task_id"
}
slave_id {
value: "ce246e68-974b-4af5-9f23-39ab6f4be186-S179"
}
resources {
name: "cpus"
type: SCALAR
scalar {
value: 1.0
}
role: "*"
}
resources {
name: "disk"
type: SCALAR
scalar {
value: 2048.0
}
role: "*"
}
resources {
name: "mem"
type: SCALAR
scalar {
value: 1024.0
}
role: "*"
}
executor {
executor_id {
value: "executor_id"
}
resources {
name: "cpus"
type: SCALAR
scalar {
value: 0.25
}
role: "*"
}
resources {
name: "mem"
type: SCALAR
scalar {
value: 128.0
}
role: "*"
}
command {
uris {
value: "/usr/share/aurora/bin/thermos_executor.pex"
executable: true
}
uris {
value: "/etc/mesos-files/.docker/config.json"
executable: true
}
uris {
value: "/usr/share/aurora/bin/thermos_executor.pex"
executable: true
}
value: "${MESOS_SANDBOX=.}/thermos_executor.pex --announcer-ensemble
(zookeepers) --mesos-containerizer-path /usr/libexec/mesos/mesos-containerizer
--runner-logger-destination file --runner-logger-mode rotate
--runner-rotate-log-backups 10 --runner-rotate-log-size-mb 5
--sandbox-mount-point /srv/apps/mesos/var/sandbox"
}
framework_id {
value: "6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000"
}
name: "AuroraExecutor"
source: "source"
container {
type: DOCKER
docker {
image: "docker"
}
}
labels {
labels {
key: "source"
value: "source"
}
}
}
data: “data"
labels {
labels {
key: "org.apache.aurora.tier"
value: "preemptible"
}
}
discovery {
visibility: CLUSTER
name: "name"
environment: "devel"
location: "aurora-eu-west-1"
}
I0207 09:45:08.070538 12 executor_base.py:45] Executor [None]: launchTask got
task: task 12 executor_base.py:45] Executor
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Assigned task:
AssignedTask(task=TaskConfig(isService=False, contactEmail=None, taskLinks={},
tier='preemptible', mesosFetcherUris=set([]),
executorConfig=ExecutorConfig(data='{"environment": "devel",
"health_check_config": {"health_checker": {"http": {"expected_response_code":
0, "endpoint": "/health", "expected_response": "ok"}},
"min_consecutive_successes": 1, "initial_interval_secs": 15.0,
"max_consecutive_failures": 0, "timeout_secs": 1.0, "interval_secs": 10.0},
"container": {"image": "docker", "parameters": []}, "name": "name", "service":
false, "max_task_failures": 1, "cron_collision_policy": "KILL_EXISTING",
"enable_hooks": false, "cluster": "devcluster", "task": {"processes":
[{processes, requestedPorts=set([]), maxTaskFailures=1, priority=1, ramMb=1024,
job=JobKey(environment='devel', role='role', name='name'), production=False,
diskMb=2048, resources=set([Resource(ramMb=None, numGpus=None, namedPort=None,
diskMb=2048, numCpus=None), Resource(ramMb=None, numGpus=None, namedPort=None,
diskMb=None, numCpus=1.0), Resource(ramMb=1024, numGpus=None, namedPort=None,
diskMb=None, numCpus=None)]), owner=Identity(user='user'),
container=Container(docker=DockerContainer(image='ddocker', parameters=[]),
mesos=None), metadata=set([]), numCpus=1.0,
constraints=set([Constraint(name='stage', constraint=TaskConstraint(limit=None,
value=ValueConstraint(negated=False, values=set(['sandbox']))))])),
taskId='t_id', instanceId=0, assignedPorts={}, slaveHost='host', slaveId='id’)
I0207 09:45:08.071496 12 executor_base.py:45] Executor
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id=> STARTING
I0207 09:45:08.071634 12 executor_base.py:45] Executor
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Reason: Initializing sandbox.
D0207 09:45:08.072333 12 sandbox.py:129] DirectorySandbox: mkdir
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
D0207 09:45:09.482642 12 ckpt.py:379] Running state machine for
process=setup_environment/seq=0
D0207 09:45:09.482762 12 ckpt.py:379] Running state machine for
process=setup_environment/seq=1
D0207 09:45:09.482841 12 recordio.py:137]
/srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data
(current offset = 981)
D0207 09:45:09.482932 12 resource.py:136] Initialising resource collection for
task task_id
D0207 09:45:09.483011 12 resource.py:147] Initialising ResourceHistory of
length 180
D0207 09:45:09.620017 12 resource.py:187] Commencing resource monitoring for
task "task_id"
I0207 09:45:09.620552 12 aurora_executor.py:188] Send TASK_RUNNING status
update. status: StatusResult('No health-check defined, task is assumed
healthy.', status='TASK_RUNNING')
I0207 09:45:09.620950 12 executor_base.py:45] Executor
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Updating task_id => RUNNING
I0207 09:45:09.621148 12 executor_base.py:45] Executor
[ce246e68-974b-4af5-9f23-39ab6f4be186-S179]: Reason: No health-check
defined, task is assumed healthy.
D0207 09:45:09.622956 12 disk.py:44] DiskCollectorThread: finished collection
of
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
in 0.8ms
D0207 09:45:29.654671 12 ckpt.py:379] Running state machine for
process=setup_environment/seq=2
D0207 09:45:29.654998 12 ckpt.py:379] Running state machine for
process=setup_environment/seq=3
D0207 09:45:29.655204 12 ckpt.py:379] Running state machine for
process=s3_download/seq=0
D0207 09:45:29.655424 12 ckpt.py:379] Running state machine for
process=s3_download/seq=1
D0207 09:45:29.655628 12 ckpt.py:379] Running state machine for
process=s3_download/seq=2
D0207 09:45:29.655834 12 ckpt.py:379] Running state machine for
process=s3_download/seq=3
D0207 09:45:29.656034 12 ckpt.py:379] Running state machine for
process=single_h265_transcode_file/seq=0
D0207 09:45:29.656261 12 ckpt.py:379] Running state machine for
process=single_h265_transcode_file/seq=1
D0207 09:45:29.656470 12 ckpt.py:379] Running state machine for
process=single_h265_transcode_file/seq=2
D0207 09:45:29.656608 12 recordio.py:137]
/srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner has no data
(current offset = 1569)
D0207 09:45:49.657543 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24659524141
D0207 09:46:09.659521 12 disk.py:44] DiskCollectorThread: finished collection
of
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
in 1.3ms
D0207 09:46:09.665703 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24998856702
D0207 09:46:29.758064 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24475260671
D0207 09:46:49.759083 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24843625817
D0207 09:47:09.756055 12 disk.py:44] DiskCollectorThread: finished collection
of
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
in 1.3ms
D0207 09:47:09.759665 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24596081776
D0207 09:47:29.761161 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24940960757
D0207 09:47:49.761781 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24846117073
D0207 09:48:09.855884 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24563907521
D0207 09:48:09.862927 12 disk.py:44] DiskCollectorThread: finished collection
of
/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
in 0.9ms
D0207 09:48:29.957253 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24767839681
D0207 09:48:49.960608 12 process_collector_psutil.py:93] Calculated rate for
pid=63 and children: 1.24828815758
W0207 09:45:08.216031 12 monitor.py:92] Could not read from checkpoint
/srv/apps/mesos/var/sandbox/checkpoints/checkpoints/task_id/runner
I0207 09:45:08.216214 12 thermos_task_runner.py:278] Forking off runner with
cmdline: /usr/bin/python2.7 /srv/apps/mesos/var/sandbox/thermos_runner.pex
--process_logger_mode=rotate --task_id=task_id --log_to_disk=DEBUG
--rotate_log_size_mb=5 --hostname=host --rotate_log_backups=10
--thermos_json=/srv/apps/mesos/var/sandbox/task.json
--sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
--log_dir=/srv/apps/mesos/var/sandbox
--checkpoint_root=/srv/apps/mesos/var/sandbox/checkpoints
–container_sandbox=/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox
--process_logger_destination=file
D0207 09:45:08.218667 12 thermos_task_runner.py:291] Waiting for task to start.
D0207 09:45:08.218888 12 thermos_task_runner.py:300] - sleeping...
D0207 09:45:08.719671 12 thermos_task_runner.py:300] - sleeping...
D0207 09:45:09.232539 12 aurora_executor.py:159] Task started.
W0207 09:45:09.355899 12 health_checker.py:389] No health-checks defined, will
use a no-op health-checker.
D0207 09:45:09.482341 12 ckpt.py:348] Flipping task state from (undefined) to
ACTIVE
D0207 09:45:09.482521 12 ckpt.py:328] Initializing TaskRunner header to
RunnerHeader(launch_time_ms=1517996708593, uid=0, task_id='task_id',
hostname='host',
sandbox='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/task_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox',
log_dir='/srv/apps/mesos/var/workdir/slaves/ce246e68-974b-4af5-9f23-39ab6f4be186-S179/frameworks/6ff6b40c-3dfe-419f-b0ba-df52eb354be5-0000/executors/ttask_id/runs/300ff840-b4f8-4b25-8283-4b13cc7fab2d/sandbox/.logs',
ports={}, user='root')
For what I see in : D0207 09:47:29.761161 12 process_collector_psutil.py:93]
Calculated rate for pid=63 and children: 1.24940960757
It's actually calculating the CPU usage but it's not showing it.
AWS agents have Ubuntu 14.04.5 LTS installed and docker containers where jobs
are running, Debian 9.
Thank you.
Miguel.