Re: Adding a new agent terminates existing executors?

James Peach Wed, 15 Nov 2017 08:38:17 -0800

> On Nov 15, 2017, at 8:24 AM, Dan Leary <[email protected]> wrote:
> 
> Yes, as I said at the outset, the agents are on the same host, with different 
> ip's and hostname's and work_dir's.
> If having separate work_dirs is not sufficient to keep containers separated 
> by agent, what additionally is required?


You might also need to specify other separate agent directories, like 
--runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of 
mesos-agent --flags.

> 
> 
> On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone <[email protected]> wrote:
> How is agent2 able to see agent1's containers? Are they running on the same 
> box!? Are they somehow sharing the filesystem? If yes, that's not supported.
> 
> On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary <[email protected]> wrote:
> Sure, master log and agent logs are attached.
> 
> Synopsis:  In the master log, tasks t000001 and t000002 are running...
> 
> > I1114 17:08:15.972033  5443 master.cpp:6841] Status update TASK_RUNNING 
> > (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t000001 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:19.142276  5448 master.cpp:6841] Status update TASK_RUNNING 
> > (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t000002 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> 
> Operator starts up agent2 around 17:08:50ish.  Executor1 and its tasks are 
> terminated....
> 
> > I1114 17:08:54.835841  5447 master.cpp:6964] Executor 'executor1' of 
> > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 
> > (agent1): terminated with signal Killed
> > I1114 17:08:54.835959  5447 master.cpp:9051] Removing executor 'executor1' 
> > with resources [] of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on 
> > agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 
> > (agent1)
> > I1114 17:08:54.837419  5436 master.cpp:6841] Status update TASK_FAILED 
> > (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t000001 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.837497  5436 master.cpp:6903] Forwarding status update 
> > TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t000001 
> > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.837896  5436 master.cpp:8928] Updating the state of task 
> > t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest 
> > state: TASK_FAILED, status update state: TASK_FAILED)
> > I1114 17:08:54.839159  5436 master.cpp:6841] Status update TASK_FAILED 
> > (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t000002 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.839221  5436 master.cpp:6903] Forwarding status update 
> > TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t000002 
> > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.839493  5436 master.cpp:8928] Updating the state of task 
> > t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest 
> > state: TASK_FAILED, status update state: TASK_FAILED)
> 
> But agent2 doesn't register until later...
> 
> > I1114 17:08:55.588762  5442 master.cpp:5714] Received register agent 
> > message from slave(1)@127.1.1.2:5052 (agent2)
> 
> Meanwhile in the agent1 log, the termination of executor1 appears to be the 
> result of the destruction of its container...
> 
> > I1114 17:08:54.810638  5468 containerizer.cpp:2612] Container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited
> > I1114 17:08:54.810732  5468 containerizer.cpp:2166] Destroying container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.810761  5468 containerizer.cpp:2712] Transitioning the state 
> > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> 
> Apparently because agent2 decided to "recover" the very same container...
> 
> > I1114 17:08:54.775907  6041 linux_launcher.cpp:373] 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container
> > I1114 17:08:54.779634  6037 containerizer.cpp:966] Cleaning up orphan 
> > container cbcf6992-3094-4d0f-8482-4d68f68eae84
> > I1114 17:08:54.779705  6037 containerizer.cpp:2166] Destroying container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.779737  6037 containerizer.cpp:2712] Transitioning the state 
> > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> > I1114 17:08:54.780740  6041 linux_launcher.cpp:505] Asked to destroy 
> > container cbcf6992-3094-4d0f-8482-4d68f68eae84
> 
> Seems like an issue with the containerizer?
> 
> 
> On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone <[email protected]> wrote:
> That seems weird then. A new agent coming up on a new ip and host, shouldn't 
> affect other agents running on different hosts. Can you share master logs 
> that surface the issue?
> 
> On Tue, Nov 14, 2017 at 12:51 PM, Dan Leary <[email protected]> wrote:
> Just one mesos-master (no zookeeper) with --ip=127.0.0.1 --hostname=localhost.
> In /etc/hosts are 
>   127.1.1.1    agent1
>   127.1.1.2    agent2
> etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1 etc.
> 
> 
> On Tue, Nov 14, 2017 at 3:41 PM, Vinod Kone <[email protected]> wrote:
> ```Experiments thus far are with a cluster all on a single host, master on 
> 127.0.0.1, agents have their own ip's and hostnames and ports.```
> 
> What does this mean? How are all your masters and agents on the same host but 
> still get different ips and hostnames?
> 
> 
> On Tue, Nov 14, 2017 at 12:22 PM, Dan Leary <[email protected]> wrote:
> So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP API, 
> custom executor, checkpointing disabled.
> When the framework is running happily and a new agent is added to the cluster 
> all the existing executors immediately get terminated.
> The scheduler is told of the lost executors and tasks and then receives 
> offers about agents old and new and carries on normally.
> 
> I would expect however that the existing executors should keep running and 
> the scheduler should just receive offers about the new agent.
> It's as if agent recovery is being performed when the new agent is launched 
> even though no old agent has exited.
> Experiments thus far are with a cluster all on a single host, master on 
> 127.0.0.1, agents have their own ip's and hostnames and ports.
> 
> Am I missing a configuration parameter?   Or is this correct behavior?
> 
> -Dan
> 
> 
> 
> 
> 
> 
>

Re: Adding a new agent terminates existing executors?

Reply via email to