Hi -
I am looking at revving the mesos-storm framework to be dockerized (and
simpler).
I’m using mesos 0.22.0-1.0.ubuntu1404
mesos master + mesos slave are deployed in docker containers, in case it
matters.
I have the storm (nimbus) framework launching fine as a docker container, but
launching tasks for a topology is having problems related to using a
docker-based executor.
For example.
TaskInfo task = TaskInfo.newBuilder()
.setName("worker " + slot.getNodeId() + ":" + slot.getPort())
.setTaskId(taskId)
.setSlaveId(offer.getSlaveId())
.setExecutor(ExecutorInfo.newBuilder()
.setExecutorId(ExecutorID.newBuilder().setValue(details.getId()))
.setData(ByteString.copyFromUtf8(executorDataStr))
.setContainer(ContainerInfo.newBuilder()
.setType(ContainerInfo.Type.DOCKER)
.setDocker(ContainerInfo.DockerInfo.newBuilder()
.setImage("mesos-storm”)))
.setCommand(CommandInfo.newBuilder().setShell(true).setValue("storm supervisor
storm.mesos.MesosSupervisor"))
//rest is unchanged from existing mesos-storm framework code
The executor launches and exits quickly - see the log msg: Executor for
container '88ce3658-7d9c-4b5f-b69a-cb5e48125dfd' has exited
It seems like mesos loses track of the executor? I understand there is a 1 min
timeout on registering the executor, but the exit happens well before 1 minute.
I tried a few alternate commands to experiment, and I can see in the stdout for
the task that
"echo testing123 && echo testing456”
prints to stdout correctly, both testing123 and testing456
however:
"echo testing123a && sleep 10 && echo testing456a”
prints only testing123a, presumably because the container is lost and destroyed
before the sleep time is up.
So it’s like the container for the executor is only allowed to run for .5
seconds, then it is detected as exited, and the task is lost.
Thanks for any advice.
Tyson
slave logs look like:
mesosslave_1 | I0417 19:07:27.461230 11 slave.cpp:1121] Got assigned task
mesos-slave1.service.consul-31000 for framework
20150417-190611-2801799596-5050-1-0000
mesosslave_1 | I0417 19:07:27.461479 11 slave.cpp:1231] Launching task
mesos-slave1.service.consul-31000 for framework
20150417-190611-2801799596-5050-1-0000
mesosslave_1 | I0417 19:07:27.463250 11 slave.cpp:4160] Launching executor
insights-1-1429297638 of framework 20150417-190611-2801799596-5050-1-0000 in
work directory
'/tmp/mesos/slaves/20150417-190611-2801799596-5050-1-S0/frameworks/20150417-190611-2801799596-5050-1-0000/executors/insights-1-1429297638/runs/6539127f-9dbb-425b-86a8-845b748f0cd3'
mesosslave_1 | I0417 19:07:27.463444 11 slave.cpp:1378] Queuing task
'mesos-slave1.service.consul-31000' for executor insights-1-1429297638 of
framework '20150417-190611-2801799596-5050-1-0000
mesosslave_1 | I0417 19:07:27.467200 7 docker.cpp:755] Starting container
'6539127f-9dbb-425b-86a8-845b748f0cd3' for executor 'insights-1-1429297638' and
framework '20150417-190611-2801799596-5050-1-0000'
mesosslave_1 | I0417 19:07:27.985935 7 docker.cpp:1333] Executor for
container '6539127f-9dbb-425b-86a8-845b748f0cd3' has exited
mesosslave_1 | I0417 19:07:27.986359 7 docker.cpp:1159] Destroying
container '6539127f-9dbb-425b-86a8-845b748f0cd3'
mesosslave_1 | I0417 19:07:27.986021 9 slave.cpp:3135] Monitoring executor
'insights-1-1429297638' of framework '20150417-190611-2801799596-5050-1-0000'
in container '6539127f-9dbb-425b-86a8-845b748f0cd3'
mesosslave_1 | I0417 19:07:27.986464 7 docker.cpp:1248] Running docker
stop on container '6539127f-9dbb-425b-86a8-845b748f0cd3'
mesosslave_1 | I0417 19:07:28.286761 10 slave.cpp:3186] Executor
'insights-1-1429297638' of framework 20150417-190611-2801799596-5050-1-0000 has
terminated with unknown status
mesosslave_1 | I0417 19:07:28.288784 10 slave.cpp:2508] Handling status
update TASK_LOST (UUID: 0795a58b-f487-42e2-aaa1-a26fe6834ed7) for task
mesos-slave1.service.consul-31000 of framework
20150417-190611-2801799596-5050-1-0000 from @0.0.0.0:0
mesosslave_1 | W0417 19:07:28.289227 9 docker.cpp:841] Ignoring updating
unknown container: 6539127f-9dbb-425b-86a8-845b748f0cd3
nimbus logs (framework) look like:
2015-04-17T19:07:28.302+0000 s.m.MesosNimbus [INFO] Received status update:
task_id {
value: "mesos-slave1.service.consul-31000"
}
state: TASK_LOST
message: "Container terminated"
slave_id {
value: "20150417-190611-2801799596-5050-1-S0"
}
timestamp: 1.429297648286981E9
source: SOURCE_SLAVE
reason: REASON_EXECUTOR_TERMINATED
11: "\a\225\245\213\364\207B\342\252\241\242o\346\203N\327"