Re: Adding a new agent terminates existing executors?
Understood. Thanks for the help. On Wed, Nov 15, 2017 at 3:04 PM, Vinod Kone wrote: > Yes, there are a bunch of flags that need to be different. There are > likely some isolators which will not work correctly when you have multiple > agents on the same host even then. The garbage collector assumes it has > sole access to the disk containing work dir etc etc. > > In general, running multiple agents on the same host is not tested and is > not recommended at all for production. For testing purposes, I would > recommend putting agents on different VMs. > > On Wed, Nov 15, 2017 at 11:58 AM, Dan Leary wrote: > >> Bingo. >> It probably doesn't hurt to differentiate --runtime_dir per agent but the >> real problem is that --cgroups_root needs to be different too. >> As one might infer from linux_launcher.cpp: >> >> Future> LinuxLauncherProcess::recover( >>> const list& states) >>> { >>> // Recover all of the "containers" we know about based on the >>> // existing cgroups. >>> Try> cgroups = >>> cgroups::get(freezerHierarchy, flags.cgroups_root); >> >> >> Thanks much. >> >> On Wed, Nov 15, 2017 at 11:37 AM, James Peach wrote: >> >>> >>> > On Nov 15, 2017, at 8:24 AM, Dan Leary wrote: >>> > >>> > Yes, as I said at the outset, the agents are on the same host, with >>> different ip's and hostname's and work_dir's. >>> > If having separate work_dirs is not sufficient to keep containers >>> separated by agent, what additionally is required? >>> >>> You might also need to specify other separate agent directories, like >>> --runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of >>> mesos-agent --flags. >>> >>> > >>> > >>> > On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone >>> wrote: >>> > How is agent2 able to see agent1's containers? Are they running on the >>> same box!? Are they somehow sharing the filesystem? If yes, that's not >>> supported. >>> > >>> >>> >> >
Re: Adding a new agent terminates existing executors?
Yes, there are a bunch of flags that need to be different. There are likely some isolators which will not work correctly when you have multiple agents on the same host even then. The garbage collector assumes it has sole access to the disk containing work dir etc etc. In general, running multiple agents on the same host is not tested and is not recommended at all for production. For testing purposes, I would recommend putting agents on different VMs. On Wed, Nov 15, 2017 at 11:58 AM, Dan Leary wrote: > Bingo. > It probably doesn't hurt to differentiate --runtime_dir per agent but the > real problem is that --cgroups_root needs to be different too. > As one might infer from linux_launcher.cpp: > > Future> LinuxLauncherProcess::recover( >> const list& states) >> { >> // Recover all of the "containers" we know about based on the >> // existing cgroups. >> Try> cgroups = >> cgroups::get(freezerHierarchy, flags.cgroups_root); > > > Thanks much. > > On Wed, Nov 15, 2017 at 11:37 AM, James Peach wrote: > >> >> > On Nov 15, 2017, at 8:24 AM, Dan Leary wrote: >> > >> > Yes, as I said at the outset, the agents are on the same host, with >> different ip's and hostname's and work_dir's. >> > If having separate work_dirs is not sufficient to keep containers >> separated by agent, what additionally is required? >> >> You might also need to specify other separate agent directories, like >> --runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of >> mesos-agent --flags. >> >> > >> > >> > On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone >> wrote: >> > How is agent2 able to see agent1's containers? Are they running on the >> same box!? Are they somehow sharing the filesystem? If yes, that's not >> supported. >> > >> >> >
Re: Adding a new agent terminates existing executors?
Bingo. It probably doesn't hurt to differentiate --runtime_dir per agent but the real problem is that --cgroups_root needs to be different too. As one might infer from linux_launcher.cpp: Future> LinuxLauncherProcess::recover( > const list& states) > { > // Recover all of the "containers" we know about based on the > // existing cgroups. > Try> cgroups = > cgroups::get(freezerHierarchy, flags.cgroups_root); Thanks much. On Wed, Nov 15, 2017 at 11:37 AM, James Peach wrote: > > > On Nov 15, 2017, at 8:24 AM, Dan Leary wrote: > > > > Yes, as I said at the outset, the agents are on the same host, with > different ip's and hostname's and work_dir's. > > If having separate work_dirs is not sufficient to keep containers > separated by agent, what additionally is required? > > You might also need to specify other separate agent directories, like > --runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of > mesos-agent --flags. > > > > > > > On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone > wrote: > > How is agent2 able to see agent1's containers? Are they running on the > same box!? Are they somehow sharing the filesystem? If yes, that's not > supported. > > > >
Re: Adding a new agent terminates existing executors?
> On Nov 15, 2017, at 8:24 AM, Dan Leary wrote: > > Yes, as I said at the outset, the agents are on the same host, with different > ip's and hostname's and work_dir's. > If having separate work_dirs is not sufficient to keep containers separated > by agent, what additionally is required? You might also need to specify other separate agent directories, like --runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of mesos-agent --flags. > > > On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone wrote: > How is agent2 able to see agent1's containers? Are they running on the same > box!? Are they somehow sharing the filesystem? If yes, that's not supported. > > On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary wrote: > Sure, master log and agent logs are attached. > > Synopsis: In the master log, tasks t01 and t02 are running... > > > I1114 17:08:15.972033 5443 master.cpp:6841] Status update TASK_RUNNING > > (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t01 of framework > > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1) > > I1114 17:08:19.142276 5448 master.cpp:6841] Status update TASK_RUNNING > > (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t02 of framework > > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1) > > Operator starts up agent2 around 17:08:50ish. Executor1 and its tasks are > terminated > > > I1114 17:08:54.835841 5447 master.cpp:6964] Executor 'executor1' of > > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > > (agent1): terminated with signal Killed > > I1114 17:08:54.835959 5447 master.cpp:9051] Removing executor 'executor1' > > with resources [] of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on > > agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > > (agent1) > > I1114 17:08:54.837419 5436 master.cpp:6841] Status update TASK_FAILED > > (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 of framework > > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1) > > I1114 17:08:54.837497 5436 master.cpp:6903] Forwarding status update > > TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 > > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 > > I1114 17:08:54.837896 5436 master.cpp:8928] Updating the state of task > > t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest > > state: TASK_FAILED, status update state: TASK_FAILED) > > I1114 17:08:54.839159 5436 master.cpp:6841] Status update TASK_FAILED > > (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 of framework > > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1) > > I1114 17:08:54.839221 5436 master.cpp:6903] Forwarding status update > > TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 > > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 > > I1114 17:08:54.839493 5436 master.cpp:8928] Updating the state of task > > t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest > > state: TASK_FAILED, status update state: TASK_FAILED) > > But agent2 doesn't register until later... > > > I1114 17:08:55.588762 5442 master.cpp:5714] Received register agent > > message from slave(1)@127.1.1.2:5052 (agent2) > > Meanwhile in the agent1 log, the termination of executor1 appears to be the > result of the destruction of its container... > > > I1114 17:08:54.810638 5468 containerizer.cpp:2612] Container > > cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited > > I1114 17:08:54.810732 5468 containerizer.cpp:2166] Destroying container > > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state > > I1114 17:08:54.810761 5468 containerizer.cpp:2712] Transitioning the state > > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING > > Apparently because agent2 decided to "recover" the very same container... > > > I1114 17:08:54.775907 6041 linux_launcher.cpp:373] > > cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container > > I1114 17:08:54.779634 6037 containerizer.cpp:966] Cleaning up orphan > > container cbcf6992-3094-4d0f-8482-4d68f68eae84 > > I1114 17:08:54.779705 6037 containerizer.cpp:2166] Destroying container > > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state > > I1114 17:08:54.779737 6037 containerizer.cpp:2712] Transitioning the state > > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING > > I1114 17:08:54.780740 6041 linux_launcher.cpp:505] Asked to destroy > > container cbcf6992-3094-4d0f-8482-4d68f68eae84 > > Seems like a
Re: Adding a new agent terminates existing executors?
Yes, as I said at the outset, the agents are on the same host, with different ip's and hostname's and work_dir's. If having separate work_dirs is not sufficient to keep containers separated by agent, what additionally is required? On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone wrote: > How is agent2 able to see agent1's containers? Are they running on the > same box!? Are they somehow sharing the filesystem? If yes, that's not > supported. > > On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary wrote: > >> Sure, master log and agent logs are attached. >> >> Synopsis: In the master log, tasks t01 and t02 are running... >> >> > I1114 17:08:15.972033 5443 master.cpp:6841] Status update TASK_RUNNING >> (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t01 of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1) >> > I1114 17:08:19.142276 5448 master.cpp:6841] Status update TASK_RUNNING >> (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t02 of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1) >> >> Operator starts up agent2 around 17:08:50ish. Executor1 and its tasks >> are terminated >> >> > I1114 17:08:54.835841 5447 master.cpp:6964] Executor 'executor1' of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1): terminated with signal Killed >> > I1114 17:08:54.835959 5447 master.cpp:9051] Removing executor >> 'executor1' with resources [] of framework >> 10aa0208-4a85-466c-af89-7e73617516f5-0001 >> on agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@ >> 127.1.1.1:5051 (agent1) >> > I1114 17:08:54.837419 5436 master.cpp:6841] Status update TASK_FAILED >> (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1) >> > I1114 17:08:54.837497 5436 master.cpp:6903] Forwarding status update >> TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task >> t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 >> > I1114 17:08:54.837896 5436 master.cpp:8928] Updating the state of task >> t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest >> state: TASK_FAILED, status update state: TASK_FAILED) >> > I1114 17:08:54.839159 5436 master.cpp:6841] Status update TASK_FAILED >> (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 of >> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent >> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 >> (agent1) >> > I1114 17:08:54.839221 5436 master.cpp:6903] Forwarding status update >> TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task >> t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 >> > I1114 17:08:54.839493 5436 master.cpp:8928] Updating the state of task >> t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest >> state: TASK_FAILED, status update state: TASK_FAILED) >> >> But agent2 doesn't register until later... >> >> > I1114 17:08:55.588762 5442 master.cpp:5714] Received register agent >> message from slave(1)@127.1.1.2:5052 (agent2) >> >> Meanwhile in the agent1 log, the termination of executor1 appears to be >> the result of the destruction of its container... >> >> > I1114 17:08:54.810638 5468 containerizer.cpp:2612] Container >> cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited >> > I1114 17:08:54.810732 5468 containerizer.cpp:2166] Destroying >> container cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state >> > I1114 17:08:54.810761 5468 containerizer.cpp:2712] Transitioning the >> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to >> DESTROYING >> >> Apparently because agent2 decided to "recover" the very same container... >> >> > I1114 17:08:54.775907 6041 linux_launcher.cpp:373] >> cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container >> > I1114 17:08:54.779634 6037 containerizer.cpp:966] Cleaning up orphan >> container cbcf6992-3094-4d0f-8482-4d68f68eae84 >> > I1114 17:08:54.779705 6037 containerizer.cpp:2166] Destroying >> container cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state >> > I1114 17:08:54.779737 6037 containerizer.cpp:2712] Transitioning the >> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to >> DESTROYING >> > I1114 17:08:54.780740 6041 linux_launcher.cpp:505] Asked to destroy >> container cbcf6992-3094-4d0f-8482-4d68f68eae84 >> >> Seems like an issue with the containerizer? >> >> >> On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone wrote: >> >>> That seems weird then. A new agent coming up on a new ip and host, >>> shouldn't affect other agents running on different hosts. Can you share >>> master
Re: Adding a new agent terminates existing executors?
How is agent2 able to see agent1's containers? Are they running on the same box!? Are they somehow sharing the filesystem? If yes, that's not supported. On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary wrote: > Sure, master log and agent logs are attached. > > Synopsis: In the master log, tasks t01 and t02 are running... > > > I1114 17:08:15.972033 5443 master.cpp:6841] Status update TASK_RUNNING > (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t01 of > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > (agent1) > > I1114 17:08:19.142276 5448 master.cpp:6841] Status update TASK_RUNNING > (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t02 of > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > (agent1) > > Operator starts up agent2 around 17:08:50ish. Executor1 and its tasks are > terminated > > > I1114 17:08:54.835841 5447 master.cpp:6964] Executor 'executor1' of > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > (agent1): terminated with signal Killed > > I1114 17:08:54.835959 5447 master.cpp:9051] Removing executor > 'executor1' with resources [] of framework > 10aa0208-4a85-466c-af89-7e73617516f5-0001 > on agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@ > 127.1.1.1:5051 (agent1) > > I1114 17:08:54.837419 5436 master.cpp:6841] Status update TASK_FAILED > (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 of > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > (agent1) > > I1114 17:08:54.837497 5436 master.cpp:6903] Forwarding status update > TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 > > I1114 17:08:54.837896 5436 master.cpp:8928] Updating the state of task > t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest > state: TASK_FAILED, status update state: TASK_FAILED) > > I1114 17:08:54.839159 5436 master.cpp:6841] Status update TASK_FAILED > (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 of > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 > (agent1) > > I1114 17:08:54.839221 5436 master.cpp:6903] Forwarding status update > TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 > > I1114 17:08:54.839493 5436 master.cpp:8928] Updating the state of task > t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest > state: TASK_FAILED, status update state: TASK_FAILED) > > But agent2 doesn't register until later... > > > I1114 17:08:55.588762 5442 master.cpp:5714] Received register agent > message from slave(1)@127.1.1.2:5052 (agent2) > > Meanwhile in the agent1 log, the termination of executor1 appears to be > the result of the destruction of its container... > > > I1114 17:08:54.810638 5468 containerizer.cpp:2612] Container > cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited > > I1114 17:08:54.810732 5468 containerizer.cpp:2166] Destroying container > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state > > I1114 17:08:54.810761 5468 containerizer.cpp:2712] Transitioning the > state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to > DESTROYING > > Apparently because agent2 decided to "recover" the very same container... > > > I1114 17:08:54.775907 6041 linux_launcher.cpp:373] > cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container > > I1114 17:08:54.779634 6037 containerizer.cpp:966] Cleaning up orphan > container cbcf6992-3094-4d0f-8482-4d68f68eae84 > > I1114 17:08:54.779705 6037 containerizer.cpp:2166] Destroying container > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state > > I1114 17:08:54.779737 6037 containerizer.cpp:2712] Transitioning the > state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to > DESTROYING > > I1114 17:08:54.780740 6041 linux_launcher.cpp:505] Asked to destroy > container cbcf6992-3094-4d0f-8482-4d68f68eae84 > > Seems like an issue with the containerizer? > > > On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone wrote: > >> That seems weird then. A new agent coming up on a new ip and host, >> shouldn't affect other agents running on different hosts. Can you share >> master logs that surface the issue? >> >> On Tue, Nov 14, 2017 at 12:51 PM, Dan Leary wrote: >> >>> Just one mesos-master (no zookeeper) with --ip=127.0.0.1 >>> --hostname=localhost. >>> In /etc/hosts are >>> 127.1.1.1agent1 >>> 127.1.1.2agent2 >>> etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1 etc. >>> >>> >>> On Tue, Nov 14, 2017 at 3:41 PM, Vinod Kone >>
Re: Adding a new agent terminates existing executors?
That seems weird then. A new agent coming up on a new ip and host, shouldn't affect other agents running on different hosts. Can you share master logs that surface the issue? On Tue, Nov 14, 2017 at 12:51 PM, Dan Leary wrote: > Just one mesos-master (no zookeeper) with --ip=127.0.0.1 > --hostname=localhost. > In /etc/hosts are > 127.1.1.1agent1 > 127.1.1.2agent2 > etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1 etc. > > > On Tue, Nov 14, 2017 at 3:41 PM, Vinod Kone wrote: > >> ```Experiments thus far are with a cluster all on a single host, master >> on 127.0.0.1, agents have their own ip's and hostnames and ports.``` >> >> What does this mean? How are all your masters and agents on the same host >> but still get different ips and hostnames? >> >> >> On Tue, Nov 14, 2017 at 12:22 PM, Dan Leary wrote: >> >>> So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP >>> API, custom executor, checkpointing disabled. >>> When the framework is running happily and a new agent is added to the >>> cluster all the existing executors immediately get terminated. >>> The scheduler is told of the lost executors and tasks and then receives >>> offers about agents old and new and carries on normally. >>> >>> I would expect however that the existing executors should keep running >>> and the scheduler should just receive offers about the new agent. >>> It's as if agent recovery is being performed when the new agent is >>> launched even though no old agent has exited. >>> Experiments thus far are with a cluster all on a single host, master on >>> 127.0.0.1, agents have their own ip's and hostnames and ports. >>> >>> Am I missing a configuration parameter? Or is this correct behavior? >>> >>> -Dan >>> >>> >> >
Re: Adding a new agent terminates existing executors?
Just one mesos-master (no zookeeper) with --ip=127.0.0.1 --hostname=localhost. In /etc/hosts are 127.1.1.1agent1 127.1.1.2agent2 etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1 etc. On Tue, Nov 14, 2017 at 3:41 PM, Vinod Kone wrote: > ```Experiments thus far are with a cluster all on a single host, master > on 127.0.0.1, agents have their own ip's and hostnames and ports.``` > > What does this mean? How are all your masters and agents on the same host > but still get different ips and hostnames? > > > On Tue, Nov 14, 2017 at 12:22 PM, Dan Leary wrote: > >> So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP >> API, custom executor, checkpointing disabled. >> When the framework is running happily and a new agent is added to the >> cluster all the existing executors immediately get terminated. >> The scheduler is told of the lost executors and tasks and then receives >> offers about agents old and new and carries on normally. >> >> I would expect however that the existing executors should keep running >> and the scheduler should just receive offers about the new agent. >> It's as if agent recovery is being performed when the new agent is >> launched even though no old agent has exited. >> Experiments thus far are with a cluster all on a single host, master on >> 127.0.0.1, agents have their own ip's and hostnames and ports. >> >> Am I missing a configuration parameter? Or is this correct behavior? >> >> -Dan >> >> >
Re: Adding a new agent terminates existing executors?
```Experiments thus far are with a cluster all on a single host, master on 127.0.0.1, agents have their own ip's and hostnames and ports.``` What does this mean? How are all your masters and agents on the same host but still get different ips and hostnames? On Tue, Nov 14, 2017 at 12:22 PM, Dan Leary wrote: > So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP API, > custom executor, checkpointing disabled. > When the framework is running happily and a new agent is added to the > cluster all the existing executors immediately get terminated. > The scheduler is told of the lost executors and tasks and then receives > offers about agents old and new and carries on normally. > > I would expect however that the existing executors should keep running and > the scheduler should just receive offers about the new agent. > It's as if agent recovery is being performed when the new agent is > launched even though no old agent has exited. > Experiments thus far are with a cluster all on a single host, master on > 127.0.0.1, agents have their own ip's and hostnames and ports. > > Am I missing a configuration parameter? Or is this correct behavior? > > -Dan > >
Adding a new agent terminates existing executors?
So I have a bespoke framework that runs under 1.4.0 using the v1 HTTP API, custom executor, checkpointing disabled. When the framework is running happily and a new agent is added to the cluster all the existing executors immediately get terminated. The scheduler is told of the lost executors and tasks and then receives offers about agents old and new and carries on normally. I would expect however that the existing executors should keep running and the scheduler should just receive offers about the new agent. It's as if agent recovery is being performed when the new agent is launched even though no old agent has exited. Experiments thus far are with a cluster all on a single host, master on 127.0.0.1, agents have their own ip's and hostnames and ports. Am I missing a configuration parameter? Or is this correct behavior? -Dan