UNSUBSCRIBE

2016-11-04 Thread Zia Bhatti


Re: framework failover

2016-11-04 Thread Joseph Wu
A couple questions/notes:

What do you mean by:

> the system will deploy the framework on a new node within less than three
> minutes.

Are you running your frameworks via Marathon?

How are you terminating the Mesos Agent?  If you send a `kill -SIGUSR1`,
the agent will immediately kill all of its tasks and un-register with the
master.
If you kill the agent with some other signal, the agent will simply stop,
but tasks will continue to run.

According to the mesos GUI page cassandra holds 99-100 % of the resources
> on the terminated slave during that 14 minutes.

^ Implies that the master does not remove the agent immediately, meaning
you killed the agent, but did not kill the tasks.
During this time, the master is waiting for the agent to come back online.
If the agent doesn't come back during some (configurable) timeout, it will
notify the frameworks about the loss of an agent.

Also, it's a little odd that your frameworks will disconnect upon the agent
process dying.  You may want to investigate your framework dependencies.  A
framework should definitely not depend on the agent process (frameworks
depend on the master though).



On Fri, Nov 4, 2016 at 10:32 AM, Jaana Miettinen  wrote:

> Hi, Would you help me to find out how the framework failover happens in
> mesos 0.28.0 ?
>
>
>
> In my mesos-environment I have the following  frameworks:
>
>
>
> etcd-mesos
>
> cassandra-mesos 0.2.0-1
>
> eremitic
>
> marathon 0.15.2
>
>
>
> If I shutdown the agent (mesos-slave) in which my framework has been
> deployed from the Linux command-line by ‘halt’-command, the sytem will
> deploy the framework on a new node within less than three minutes.
>
>
>
> But when I shut down the agent in which cassandra framework is running it
> takes 14 minutes before the system recovers.
>
>
>
> According to the mesos GUI page cassandra holds 99-100 % of the resources
> on the terminated slave during that 14 minutes.
>
>
>
> Seen from the mesos-log:
>
>
>
> Line 976: I1104 08:53:29.516564 15502 master.cpp:1173] Slave
> c002796f-a98d-4e55-bee3-f51b8d06323b-S8 at slave(1)@10.254.69.140:5050
> (mesos-slave-1) disconnected
>
>  Line 977: I1104 08:53:29.516644 15502
> master.cpp:2586] Disconnecting slave c002796f-a98d-4e55-bee3-f51b8d06323b-S8
> at slave(1)@10.254.69.140:5050 (mesos-slave-1)
>
>  Line 1020: I1104 08:53:39.872681 15501
> master.cpp:1212] Framework c002796f-a98d-4e55-bee3-f51b8d06323b-0007
> (Eremetic) at scheduler(1)@10.254.69.140:31570 disconnected
>
>  Line 1021: I1104 08:53:39.872707 15501
> master.cpp:2527] Disconnecting framework 
> c002796f-a98d-4e55-bee3-f51b8d06323b-0007
> (Eremetic) at scheduler(1)@10.254.69.140:31570
>
>  Line 1080: W1104 08:54:53.621151 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0007 (Eremetic) at scheduler(1)@
> 10.254.69.140:31570
>
>  Line 1083: W1104 08:54:53.621279 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0004 (Eremetic) at scheduler(1)@
> 10.254.74.77:31956
>
>  Line 1085: W1104 08:54:53.621354 15503
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0002 (Eremetic) at scheduler(1)@
> 10.254.77.2:31460
>
>  Line 1219: I1104 09:09:09.933365 15502
> master.cpp:1212] Framework c002796f-a98d-4e55-bee3-f51b8d06323b-0005
> (cassandra.ava) at scheduler-6849089f-1a44-4101-
> b5b7-0960da81b910@10.254.69.140:36495 disconnected
>
>  Line 1220: I1104 09:09:09.933404 15502
> master.cpp:2527] Disconnecting framework 
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005
> (cassandra.ava) at scheduler-6849089f-1a44-4101-
> b5b7-0960da81b910@10.254.69.140:36495
>
>  Line 1222: W1104 09:09:09.933518 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>  Line 1223: W1104 09:09:09.933697 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>  Line 1224: W1104 09:09:09.933768 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b8d06323b-0005 (cassandra.ava) at
> scheduler-6849089f-1a44-4101-b5b7-0960da81b910@10.254.69.140:36495
>
>  Line 1225: W1104 09:09:09.933825 15502
> master.hpp:1764] Master attempted to send message to disconnected framework
> c002796f-a98d-4e55-bee3-f51b

[VOTE] Release Apache Mesos 1.1.0 (rc3)

2016-11-04 Thread Till Toenshoff
Hi all,

Please vote on releasing the following candidate as Apache Mesos 1.1.0.


1.1.0 includes the following:

  * [MESOS-2449] - **Experimental** support for launching a group of tasks
via a new `LAUNCH_GROUP` Offer operation. Mesos will guarantee that either
all tasks or none of the tasks in the group are delivered to the executor.
Executors receive the task group via a new `LAUNCH_GROUP` event.

  * [MESOS-2533] - **Experimental** support for HTTP and HTTPS health checks.
Executors may now use the updated `HealthCheck` protobuf to implement
HTTP(S) health checks. Both default executors (command and docker) leverage
`curl` binary for sending HTTP(S) requests and connect to `127.0.0.1`,
hence a task must listen on all interfaces. On Linux, for BRIDGE and USER
modes, docker executor enters the task's network namespace.

  * [MESOS-3421] - **Experimental** Support sharing of resources across
containers. Currently persistent volumes are the only resources allowed to
be shared.

  * [MESOS-3567] - **Experimental** support for TCP health checks. Executors
may now use the updated `HealthCheck` protobuf to implement TCP health
checks. Both default executors (command and docker) connect to `127.0.0.1`,
hence a task must listen on all interfaces. On Linux, for BRIDGE and USER
modes, docker executor enters the task's network namespace.

  * [MESOS-4324] - Allow tasks to access persistent volumes in either a
read-only or read-write manner. Using a volume in read-only mode can
simplify sharing that volume between multiple tasks on the same agent.

  * [MESOS-5275] - **Experimental** support for linux capabilities. Frameworks
or operators now have fine-grained control over the capabilities that a
container may have. This allows a container to run as root, but not have all
the privileges associated with the root user (e.g., CAP_SYS_ADMIN).

  * [MESOS-5344] - **Experimental** support for partition-aware Mesos
frameworks. In previous Mesos releases, when an agent is partitioned from
the master and then reregisters with the cluster, all tasks running on the
agent are terminated and the agent is shutdown. In Mesos 1.1, partitioned
agents will no longer be shutdown when they reregister with the master. By
default, tasks running on such agents will still be killed (for backward
compatibility); however, frameworks can opt-in to the new PARTITION_AWARE
capability. If they do this, their tasks will not be killed when a partition
is healed. This allows frameworks to define their own policies for how to
handle partitioned tasks. Enabling the PARTITION_AWARE capability also
introduces a new set of task states: TASK_UNREACHABLE, TASK_DROPPED,
TASK_GONE, TASK_GONE_BY_OPERATOR, and TASK_UNKNOWN. These new states are
intended to eventually replace the TASK_LOST state.

  * [MESOS-5788] - **Experimental** support for Java scheduler adapter. This
adapter allows framework developers to toggle between the old/new API
(driver/scheduler library) implementations, thereby allowing them to easily
transition their frameworks to the new v1 Scheduler API.

  * [MESOS-6014] - **Experimental** A new port-mapper CNI plugin, the
`mesos-cni-port-mapper` has been introduced. For Mesos containers, with the
CNI port-mapper plugin, users can now expose container ports through host
ports using DNAT. This is especially useful when Mesos containers are
attached to isolated CNI networks such as private bridge networks, and the
services running in the container needs to be exposed outside these
isolated networks.

  * [MESOS-6077] - **Experimental** A new default executor is introduced which
frameworks can use to launch task groups as nested containers. All the
nested containers share resources likes cpu, memory, network and volumes.

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.0-rc3


The candidate for Mesos 1.1.0 release is available at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc3/mesos-1.1.0.tar.gz

The tag to be voted on is 1.1.0-rc3:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=commit;h=1.1.0-rc3

The MD5 checksum of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc3/mesos-1.1.0.tar.gz.md5

The signature of the tarball can be found at:
https://dist.apache.org/repos/dist/dev/mesos/1.1.0-rc3/mesos-1.1.0.tar.gz.asc

The PGP key used to sign the release is here:
https://dist.apache.org/repos/dist/release/mesos/KEYS

The JAR is up in Maven in a staging repository here:
https://repository.apache.org/content/repositories/orgapachemesos-1166

Please vote on releasing this package as Apache Mesos 1.1.