[VOTE] Release Apache Mesos 1.2.3 (rc1)

2017-11-15 Thread Adam Bordelon
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.2.3.
1.2.3 is our last scheduled bug fix release in the 1.2.x branch.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.2.3-rc1


The candidate for Mesos 1.2.3 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.2.3-rc1/mesos-1.2.3.tar.gz

The tag to be voted on is 1.2.3-rc1:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.2.3-rc1

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.2.3-rc1/mesos-1.2.3.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.2.3-rc1/mesos-1.2.3.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1218

Please vote on releasing this package as Apache Mesos 1.2.3!

The vote is open until at least Mon Nov 20 22:00 PST 2017 and passes
if a majority of at least 3 +1 PMC votes are cast.

[ ] +1 Release this package as Apache Mesos 1.2.3
[ ] -1 Do not release this package because ...

Thanks,
-Adam-


Stripping Offer.AllocationInfo and Resource.AllocationInfo for non-MULTI_ROLE schedulers.

2017-11-15 Thread Benjamin Mahler
Hi folks,

When we released MULTI_ROLE support, Offers and Resources within them
included additional information, specifically the AllocationInfo which
indicated which role was being allocated to:

https://github.com/apache/mesos/blob/1.3.0/include/
mesos/v1/mesos.proto#L907-L923
https://github.com/apache/mesos/blob/1.3.0/include/
mesos/v1/mesos.proto#L1453-L1459

We included this information even for non-MULTI_ROLE schedulers, because:

(1) Any schedulers with the old protos would continue to work, since they
ignore the new fields their notion of matching resources to offers keeps
working.

(2) Any schedulers that update the protobuf, but leave their resource
matching logic as is, also continue to work since they ignore the
allocation info.

(3) Any schedulers that update the protobuf and upgrade their matching
logic, would need to update their scheduler code to reflect the changes in
the matching logic. This was OK since the scheduler is updating their own
code to line up with their own resource matching logic.

However, this change introduced some difficulty for libraries that exposed
resource matching logic and that support both schedulers that know about
allocation info and schedulers that do not. Such a library would need to do
something to ensure that both old and new schedulers work against it (e.g.
strip the information from incoming offers if the scheduler instantiated
the library without MULTI_ROLE capability).

So, we're thinking of stripping the AllocationInfo for non-MULTI_ROLE
schedulers to simplify this for libraries. Strictly speaking this is a
*breaking change* for any non-MULTI_ROLE schedulers that have already
updated their logic to depend on the AllocationInfo presence in 1.3.x or
1.4.x.

The assumption so far is that this will be an OK change since
non-MULTI_ROLE schedulers are probably ignoring this information. But
please let me know if this is not the case!

More information here: https://issues.apache.org/jira/browse/MESOS-8237

Ben


Re: Adding a new agent terminates existing executors?

2017-11-15 Thread Dan Leary
Understood.  Thanks for the help.

On Wed, Nov 15, 2017 at 3:04 PM, Vinod Kone  wrote:

> Yes, there are a bunch of flags that need to be different. There are
> likely some isolators which will not work correctly when you have multiple
> agents on the same host even then. The garbage collector assumes it has
> sole access to the disk containing work dir etc etc.
>
> In general, running multiple agents on the same host is not tested and is
> not recommended at all for production. For testing purposes, I would
> recommend putting agents on different VMs.
>
> On Wed, Nov 15, 2017 at 11:58 AM, Dan Leary  wrote:
>
>> Bingo.
>> It probably doesn't hurt to differentiate --runtime_dir per agent but the
>> real problem is that --cgroups_root needs to be different too.
>> As one might infer from linux_launcher.cpp:
>>
>> Future LinuxLauncherProcess::recover(
>>> const list& states)
>>> {
>>>   // Recover all of the "containers" we know about based on the
>>>   // existing cgroups.
>>>   Try cgroups =
>>> cgroups::get(freezerHierarchy, flags.cgroups_root);
>>
>>
>> Thanks much.
>>
>> On Wed, Nov 15, 2017 at 11:37 AM, James Peach  wrote:
>>
>>>
>>> > On Nov 15, 2017, at 8:24 AM, Dan Leary  wrote:
>>> >
>>> > Yes, as I said at the outset, the agents are on the same host, with
>>> different ip's and hostname's and work_dir's.
>>> > If having separate work_dirs is not sufficient to keep containers
>>> separated by agent, what additionally is required?
>>>
>>> You might also need to specify other separate agent directories, like
>>> --runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of
>>> mesos-agent --flags.
>>>
>>> >
>>> >
>>> > On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone 
>>> wrote:
>>> > How is agent2 able to see agent1's containers? Are they running on the
>>> same box!? Are they somehow sharing the filesystem? If yes, that's not
>>> supported.
>>> >
>>>
>>>
>>
>


[RESULT][VOTE] Release Apache Mesos 1.4.1 (rc1)

2017-11-15 Thread Kapil Arya
Hi all,

The vote for Mesos 1.4.1 (rc1) has passed with the following votes.

+1 (Binding)
--
*** Vinod Kone
*** Kapil Arya
*** Anand Mazumdar

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.4.1

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.4.1

The mesos-1.4.1.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Anand and Kapil


Re: Adding a new agent terminates existing executors?

2017-11-15 Thread Vinod Kone
Yes, there are a bunch of flags that need to be different. There are likely
some isolators which will not work correctly when you have multiple agents
on the same host even then. The garbage collector assumes it has sole
access to the disk containing work dir etc etc.

In general, running multiple agents on the same host is not tested and is
not recommended at all for production. For testing purposes, I would
recommend putting agents on different VMs.

On Wed, Nov 15, 2017 at 11:58 AM, Dan Leary  wrote:

> Bingo.
> It probably doesn't hurt to differentiate --runtime_dir per agent but the
> real problem is that --cgroups_root needs to be different too.
> As one might infer from linux_launcher.cpp:
>
> Future LinuxLauncherProcess::recover(
>> const list& states)
>> {
>>   // Recover all of the "containers" we know about based on the
>>   // existing cgroups.
>>   Try cgroups =
>> cgroups::get(freezerHierarchy, flags.cgroups_root);
>
>
> Thanks much.
>
> On Wed, Nov 15, 2017 at 11:37 AM, James Peach  wrote:
>
>>
>> > On Nov 15, 2017, at 8:24 AM, Dan Leary  wrote:
>> >
>> > Yes, as I said at the outset, the agents are on the same host, with
>> different ip's and hostname's and work_dir's.
>> > If having separate work_dirs is not sufficient to keep containers
>> separated by agent, what additionally is required?
>>
>> You might also need to specify other separate agent directories, like
>> --runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of
>> mesos-agent --flags.
>>
>> >
>> >
>> > On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone 
>> wrote:
>> > How is agent2 able to see agent1's containers? Are they running on the
>> same box!? Are they somehow sharing the filesystem? If yes, that's not
>> supported.
>> >
>>
>>
>


Re: Adding a new agent terminates existing executors?

2017-11-15 Thread Dan Leary
Bingo.
It probably doesn't hurt to differentiate --runtime_dir per agent but the
real problem is that --cgroups_root needs to be different too.
As one might infer from linux_launcher.cpp:

Future LinuxLauncherProcess::recover(
> const list& states)
> {
>   // Recover all of the "containers" we know about based on the
>   // existing cgroups.
>   Try cgroups =
> cgroups::get(freezerHierarchy, flags.cgroups_root);


Thanks much.

On Wed, Nov 15, 2017 at 11:37 AM, James Peach  wrote:

>
> > On Nov 15, 2017, at 8:24 AM, Dan Leary  wrote:
> >
> > Yes, as I said at the outset, the agents are on the same host, with
> different ip's and hostname's and work_dir's.
> > If having separate work_dirs is not sufficient to keep containers
> separated by agent, what additionally is required?
>
> You might also need to specify other separate agent directories, like
> --runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of
> mesos-agent --flags.
>
> >
> >
> > On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone 
> wrote:
> > How is agent2 able to see agent1's containers? Are they running on the
> same box!? Are they somehow sharing the filesystem? If yes, that's not
> supported.
> >
>
>


Re: Adding a new agent terminates existing executors?

2017-11-15 Thread James Peach

> On Nov 15, 2017, at 8:24 AM, Dan Leary  wrote:
> 
> Yes, as I said at the outset, the agents are on the same host, with different 
> ip's and hostname's and work_dir's.
> If having separate work_dirs is not sufficient to keep containers separated 
> by agent, what additionally is required?

You might also need to specify other separate agent directories, like 
--runtime_dir, --docker_volume_checkpoint_dir, etc. Check the output of 
mesos-agent --flags.

> 
> 
> On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone  wrote:
> How is agent2 able to see agent1's containers? Are they running on the same 
> box!? Are they somehow sharing the filesystem? If yes, that's not supported.
> 
> On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary  wrote:
> Sure, master log and agent logs are attached.
> 
> Synopsis:  In the master log, tasks t01 and t02 are running...
> 
> > I1114 17:08:15.972033  5443 master.cpp:6841] Status update TASK_RUNNING 
> > (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t01 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:19.142276  5448 master.cpp:6841] Status update TASK_RUNNING 
> > (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t02 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> 
> Operator starts up agent2 around 17:08:50ish.  Executor1 and its tasks are 
> terminated
> 
> > I1114 17:08:54.835841  5447 master.cpp:6964] Executor 'executor1' of 
> > framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 
> > (agent1): terminated with signal Killed
> > I1114 17:08:54.835959  5447 master.cpp:9051] Removing executor 'executor1' 
> > with resources [] of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on 
> > agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 
> > (agent1)
> > I1114 17:08:54.837419  5436 master.cpp:6841] Status update TASK_FAILED 
> > (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.837497  5436 master.cpp:6903] Forwarding status update 
> > TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 
> > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.837896  5436 master.cpp:8928] Updating the state of task 
> > t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest 
> > state: TASK_FAILED, status update state: TASK_FAILED)
> > I1114 17:08:54.839159  5436 master.cpp:6841] Status update TASK_FAILED 
> > (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 of framework 
> > 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent 
> > 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051 (agent1)
> > I1114 17:08:54.839221  5436 master.cpp:6903] Forwarding status update 
> > TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 
> > of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.839493  5436 master.cpp:8928] Updating the state of task 
> > t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest 
> > state: TASK_FAILED, status update state: TASK_FAILED)
> 
> But agent2 doesn't register until later...
> 
> > I1114 17:08:55.588762  5442 master.cpp:5714] Received register agent 
> > message from slave(1)@127.1.1.2:5052 (agent2)
> 
> Meanwhile in the agent1 log, the termination of executor1 appears to be the 
> result of the destruction of its container...
> 
> > I1114 17:08:54.810638  5468 containerizer.cpp:2612] Container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited
> > I1114 17:08:54.810732  5468 containerizer.cpp:2166] Destroying container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.810761  5468 containerizer.cpp:2712] Transitioning the state 
> > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> 
> Apparently because agent2 decided to "recover" the very same container...
> 
> > I1114 17:08:54.775907  6041 linux_launcher.cpp:373] 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container
> > I1114 17:08:54.779634  6037 containerizer.cpp:966] Cleaning up orphan 
> > container cbcf6992-3094-4d0f-8482-4d68f68eae84
> > I1114 17:08:54.779705  6037 containerizer.cpp:2166] Destroying container 
> > cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.779737  6037 containerizer.cpp:2712] Transitioning the state 
> > of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to DESTROYING
> > I1114 17:08:54.780740  6041 linux_launcher.cpp:505] Asked to destroy 
> > 

Re: Adding a new agent terminates existing executors?

2017-11-15 Thread Dan Leary
Yes, as I said at the outset, the agents are on the same host, with
different ip's and hostname's and work_dir's.
If having separate work_dirs is not sufficient to keep containers separated
by agent, what additionally is required?


On Wed, Nov 15, 2017 at 11:13 AM, Vinod Kone  wrote:

> How is agent2 able to see agent1's containers? Are they running on the
> same box!? Are they somehow sharing the filesystem? If yes, that's not
> supported.
>
> On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary  wrote:
>
>> Sure, master log and agent logs are attached.
>>
>> Synopsis:  In the master log, tasks t01 and t02 are running...
>>
>> > I1114 17:08:15.972033  5443 master.cpp:6841] Status update TASK_RUNNING
>> (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t01 of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1)
>> > I1114 17:08:19.142276  5448 master.cpp:6841] Status update TASK_RUNNING
>> (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t02 of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1)
>>
>> Operator starts up agent2 around 17:08:50ish.  Executor1 and its tasks
>> are terminated
>>
>> > I1114 17:08:54.835841  5447 master.cpp:6964] Executor 'executor1' of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1): terminated with signal Killed
>> > I1114 17:08:54.835959  5447 master.cpp:9051] Removing executor
>> 'executor1' with resources [] of framework 
>> 10aa0208-4a85-466c-af89-7e73617516f5-0001
>> on agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@
>> 127.1.1.1:5051 (agent1)
>> > I1114 17:08:54.837419  5436 master.cpp:6841] Status update TASK_FAILED
>> (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1)
>> > I1114 17:08:54.837497  5436 master.cpp:6903] Forwarding status update
>> TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task
>> t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
>> > I1114 17:08:54.837896  5436 master.cpp:8928] Updating the state of task
>> t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest
>> state: TASK_FAILED, status update state: TASK_FAILED)
>> > I1114 17:08:54.839159  5436 master.cpp:6841] Status update TASK_FAILED
>> (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 of
>> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
>> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
>> (agent1)
>> > I1114 17:08:54.839221  5436 master.cpp:6903] Forwarding status update
>> TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task
>> t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
>> > I1114 17:08:54.839493  5436 master.cpp:8928] Updating the state of task
>> t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest
>> state: TASK_FAILED, status update state: TASK_FAILED)
>>
>> But agent2 doesn't register until later...
>>
>> > I1114 17:08:55.588762  5442 master.cpp:5714] Received register agent
>> message from slave(1)@127.1.1.2:5052 (agent2)
>>
>> Meanwhile in the agent1 log, the termination of executor1 appears to be
>> the result of the destruction of its container...
>>
>> > I1114 17:08:54.810638  5468 containerizer.cpp:2612] Container
>> cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited
>> > I1114 17:08:54.810732  5468 containerizer.cpp:2166] Destroying
>> container cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
>> > I1114 17:08:54.810761  5468 containerizer.cpp:2712] Transitioning the
>> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to
>> DESTROYING
>>
>> Apparently because agent2 decided to "recover" the very same container...
>>
>> > I1114 17:08:54.775907  6041 linux_launcher.cpp:373]
>> cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container
>> > I1114 17:08:54.779634  6037 containerizer.cpp:966] Cleaning up orphan
>> container cbcf6992-3094-4d0f-8482-4d68f68eae84
>> > I1114 17:08:54.779705  6037 containerizer.cpp:2166] Destroying
>> container cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
>> > I1114 17:08:54.779737  6037 containerizer.cpp:2712] Transitioning the
>> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to
>> DESTROYING
>> > I1114 17:08:54.780740  6041 linux_launcher.cpp:505] Asked to destroy
>> container cbcf6992-3094-4d0f-8482-4d68f68eae84
>>
>> Seems like an issue with the containerizer?
>>
>>
>> On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone  wrote:
>>
>>> That seems weird then. A new agent coming up on a new ip and host,
>>> shouldn't affect 

Re: Adding a new agent terminates existing executors?

2017-11-15 Thread Vinod Kone
How is agent2 able to see agent1's containers? Are they running on the same
box!? Are they somehow sharing the filesystem? If yes, that's not supported.

On Wed, Nov 15, 2017 at 8:07 AM, Dan Leary  wrote:

> Sure, master log and agent logs are attached.
>
> Synopsis:  In the master log, tasks t01 and t02 are running...
>
> > I1114 17:08:15.972033  5443 master.cpp:6841] Status update TASK_RUNNING
> (UUID: 9686a6b8-b04d-4bc5-9d26-32d50c7b0f74) for task t01 of
> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
> (agent1)
> > I1114 17:08:19.142276  5448 master.cpp:6841] Status update TASK_RUNNING
> (UUID: a6c72f31-2e47-4003-b707-9e8c4fb24f05) for task t02 of
> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
> (agent1)
>
> Operator starts up agent2 around 17:08:50ish.  Executor1 and its tasks are
> terminated
>
> > I1114 17:08:54.835841  5447 master.cpp:6964] Executor 'executor1' of
> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 on agent
> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
> (agent1): terminated with signal Killed
> > I1114 17:08:54.835959  5447 master.cpp:9051] Removing executor
> 'executor1' with resources [] of framework 
> 10aa0208-4a85-466c-af89-7e73617516f5-0001
> on agent 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@
> 127.1.1.1:5051 (agent1)
> > I1114 17:08:54.837419  5436 master.cpp:6841] Status update TASK_FAILED
> (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01 of
> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
> (agent1)
> > I1114 17:08:54.837497  5436 master.cpp:6903] Forwarding status update
> TASK_FAILED (UUID: d6697064-6639-4d50-b88e-65b3eead182d) for task t01
> of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.837896  5436 master.cpp:8928] Updating the state of task
> t01 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest
> state: TASK_FAILED, status update state: TASK_FAILED)
> > I1114 17:08:54.839159  5436 master.cpp:6841] Status update TASK_FAILED
> (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02 of
> framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 from agent
> 10aa0208-4a85-466c-af89-7e73617516f5-S0 at slave(1)@127.1.1.1:5051
> (agent1)
> > I1114 17:08:54.839221  5436 master.cpp:6903] Forwarding status update
> TASK_FAILED (UUID: 7e7f2078-3455-468b-9529-23aa14f7a7e0) for task t02
> of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001
> > I1114 17:08:54.839493  5436 master.cpp:8928] Updating the state of task
> t02 of framework 10aa0208-4a85-466c-af89-7e73617516f5-0001 (latest
> state: TASK_FAILED, status update state: TASK_FAILED)
>
> But agent2 doesn't register until later...
>
> > I1114 17:08:55.588762  5442 master.cpp:5714] Received register agent
> message from slave(1)@127.1.1.2:5052 (agent2)
>
> Meanwhile in the agent1 log, the termination of executor1 appears to be
> the result of the destruction of its container...
>
> > I1114 17:08:54.810638  5468 containerizer.cpp:2612] Container
> cbcf6992-3094-4d0f-8482-4d68f68eae84 has exited
> > I1114 17:08:54.810732  5468 containerizer.cpp:2166] Destroying container
> cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.810761  5468 containerizer.cpp:2712] Transitioning the
> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to
> DESTROYING
>
> Apparently because agent2 decided to "recover" the very same container...
>
> > I1114 17:08:54.775907  6041 linux_launcher.cpp:373]
> cbcf6992-3094-4d0f-8482-4d68f68eae84 is a known orphaned container
> > I1114 17:08:54.779634  6037 containerizer.cpp:966] Cleaning up orphan
> container cbcf6992-3094-4d0f-8482-4d68f68eae84
> > I1114 17:08:54.779705  6037 containerizer.cpp:2166] Destroying container
> cbcf6992-3094-4d0f-8482-4d68f68eae84 in RUNNING state
> > I1114 17:08:54.779737  6037 containerizer.cpp:2712] Transitioning the
> state of container cbcf6992-3094-4d0f-8482-4d68f68eae84 from RUNNING to
> DESTROYING
> > I1114 17:08:54.780740  6041 linux_launcher.cpp:505] Asked to destroy
> container cbcf6992-3094-4d0f-8482-4d68f68eae84
>
> Seems like an issue with the containerizer?
>
>
> On Tue, Nov 14, 2017 at 4:46 PM, Vinod Kone  wrote:
>
>> That seems weird then. A new agent coming up on a new ip and host,
>> shouldn't affect other agents running on different hosts. Can you share
>> master logs that surface the issue?
>>
>> On Tue, Nov 14, 2017 at 12:51 PM, Dan Leary  wrote:
>>
>>> Just one mesos-master (no zookeeper) with --ip=127.0.0.1
>>> --hostname=localhost.
>>> In /etc/hosts are
>>>   127.1.1.1agent1
>>>   127.1.1.2agent2
>>> etc. and mesos-agent gets passed --ip=127.1.1.1 --hostname=agent1