> On Nov 15, 2017, at 8:24 AM, Dan Leary <d...@touchplan.io> wrote: > > Yes, as I said at the outset, the agents are on the same host, with different > ip's and hostname's and work_dir's. > If having separate work_dirs is not sufficient to keep containers separated > by agent, what additionally is required?
You might also need to specify other separate agent directories, like --runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of mesos-agent --flags. > > > On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone <vinodk...@apache.org> wrote: > How is agent2 able to see agent1's containers? Are they running on the same > box!? Are they somehow sharing the filesystem? If yes, that's not supported. > > On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary <d...@touchplan.io> wrote: > Sure, master log and agent logs are attached. > > Synopsis: In the master log, tasks t000001 and t000002 are running... > > > I1114 17:08:15.972033 5443 master.cpp:6841] Status update TASK_RUNNING > > (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t000001 of framework > > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1) > > I1114 17:08:19.142276 5448 master.cpp:6841] Status update TASK_RUNNING > > (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t000002 of framework > > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1) > > Operator starts up agent2 around 17:08:50ish. Executor1 and its tasks are > terminated.... > > > I1114 17:08:54.835841 5447 master.cpp:6964] Executor 'executor1' of > > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > > (agent1): terminated with signal Killed > > I1114 17:08:54.835959 5447 master.cpp:9051] Removing executor 'executor1' > > with resources [] of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on > > agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > > (agent1) > > I1114 17:08:54.837419 5436 master.cpp:6841] Status update TASK_FAILED > > (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t000001 of framework > > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1) > > I1114 17:08:54.837497 5436 master.cpp:6903] Forwarding status update > > TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t000001 > > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 > > I1114 17:08:54.837896 5436 master.cpp:8928] Updating the state of task > > t000001 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest > > state: TASK_FAILED, status update state: TASK_FAILED) > > I1114 17:08:54.839159 5436 master.cpp:6841] Status update TASK_FAILED > > (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t000002 of framework > > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1) > > I1114 17:08:54.839221 5436 master.cpp:6903] Forwarding status update > > TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t000002 > > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 > > I1114 17:08:54.839493 5436 master.cpp:8928] Updating the state of task > > t000002 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest > > state: TASK_FAILED, status update state: TASK_FAILED) > > But agent2 doesn't register until later... > > > I1114 17:08:55.588762 5442 master.cpp:5714] Received register agent > > message from slave(1)@127.1.1.2:5052 (agent2) > > Meanwhile in the agent1 log, the termination of executor1 appears to be the > result of the destruction of its container... > > > I1114 17:08:54.810638 5468 containerizer.cpp:2612] Container > > cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited > > I1114 17:08:54.810732 5468 containerizer.cpp:2166] Destroying container > > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state > > I1114 17:08:54.810761 5468 containerizer.cpp:2712] Transitioning the state > > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING > > Apparently because agent2 decided to "recover" the very same container... > > > I1114 17:08:54.775907 6041 linux_launcher.cpp:373] > > cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container > > I1114 17:08:54.779634 6037 containerizer.cpp:966] Cleaning up orphan > > container cbcf6992-3094-4d0f-8482-4d68f68eae84 > > I1114 17:08:54.779705 6037 containerizer.cpp:2166] Destroying container > > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state > > I1114 17:08:54.779737 6037 containerizer.cpp:2712] Transitioning the state > > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING > > I1114 17:08:54.780740 6041 linux_launcher.cpp:505] Asked to destroy > > container cbcf6992-3094-4d0f-8482-4d68f68eae84 > > Seems like an issue with the containerizer? > > > On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone <vinodk...@apache.org> wrote: > That seems weird then. A new agent coming up on a new ip and host, shouldn't > affect other agents running on different hosts. Can you share master logs > that surface the issue? > > On Tue, Nov 14, 2017 at 12:51 PM, Dan Leary <d...@touchplan.io> wrote: > Just one mesos-master (no zookeeper) with --ip=127.0.0.1 --hostname=localhost. > In /etc/hosts are > 127.1.1.1 agent1 > 127.1.1.2 agent2 > etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1 etc. > > > On Tue, Nov 14, 2017 at 3:41 PM, Vinod Kone <vinodk...@apache.org> wrote: > ```Experiments thus far are with a cluster all on a single host, master on > 127.0.0.1, agents have their own ip's and hostnames and ports.``` > > What does this mean? How are all your masters and agents on the same host but > still get different ips and hostnames? > > > On Tue, Nov 14, 2017 at 12:22 PM, Dan Leary <d...@touchplan.io> wrote: > So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP API, > custom executor, checkpointing disabled. > When the framework is running happily and a new agent is added to the cluster > all the existing executors immediately get terminated. > The scheduler is told of the lost executors and tasks and then receives > offers about agents old and new and carries on normally. > > I would expect however that the existing executors should keep running and > the scheduler should just receive offers about the new agent. > It's as if agent recovery is being performed when the new agent is launched > even though no old agent has exited. > Experiments thus far are with a cluster all on a single host, master on > 127.0.0.1, agents have their own ip's and hostnames and ports. > > Am I missing a configuration parameter? Or is this correct behavior? > > -Dan > > > > > > >