This is very cool work, i had a chat w/ another company thinking about
doing the exact same thing.

I think the proposal is missing several details that make it hard to
evaluate on paper (also saw your presentation).


1) Failure semantics, seem to be the same from the proposed design.


As a framework author, how do you suggest you deal w/ tasks on multiple
clusters, i.e.: i feel like there have to be richer semantics about the
task at least on the mesos.proto level where the state is
STATUS_FAILED_DC_OUTAGE or smth along those lines.

We respawn operators and having this information may allow me as a
framework author to wait a little longer before trying to declare that task
as dead (KILLED/FAILED/LOST) if I spawn it on a different data center.

Would love to get details on how you were thinking of extending the failure
semantics for multi datacenters.


2) Can you share more details about the allocator modules.


After reading the proposal, I anderstand it as follows.


[ gossiper ] -> [ allocator module ] -> [mesos master]


Is this correct ? if so, you are saying that you can tell the mesos master
to run a task  that was fulfilled by a framework on a different data
center?

Is the constraint that you are forced to run a scheduler per framework on
each data center?



3) High availability


High availability on a multi dc layout means something entirely different.
So are all frameworks now on standby on every other cluster? the problem i
see with this is that the metadata stored by each framework to support HA
now has to spans multiple DC's. It would be nice to perhaps at the mesos
level extend/expose an API for setting state.

a) On the normal mesos layout, this key=value data store would be
zookeeper.

b) On the multi dc layout it could be zookeeper per data center but then
one can piggy back on the gossiper to replicate that state in the other
data centers.


4) Metrics / Monitoring - probably down the line, but would be good to also
piggy back some of the mesos master endpoints
through the gossip architecture.



Again very cool work, would love to get some more details on the actual
implementation that you built plus some of the points above.

- Alex







On Wed, Jul 13, 2016 at 6:11 PM, DhilipKumar Sankaranarayanan <
[email protected]> wrote:

> Hi All,
>
> Please find the initial version of the Design Document
> <https://docs.google.com/document/d/1U4IY_ObAXUPhtTa-0Rw_5zQxHDRnJFe5uFNOQ0VUcLg/edit?usp=sharing>
> for Federating Mesos Clusters.
>
>
> https://docs.google.com/document/d/1U4IY_ObAXUPhtTa-0Rw_5zQxHDRnJFe5uFNOQ0VUcLg/edit?usp=sharing
>
> We at Huawei had been working on this federation project for the past few
> months.  We also got an opportunity to present this in recent MesosCon
> 2016. From further discussions and feedback we have received so far, we
> have greatly simplified the design.
>
> Also I see that no one assigned to this JIRA now could i get that assigned
> to myself ? It would be great to know if there is anyone willing to
> shepherd this too.
>
> I would also like to bring this up in the community Sync that happens
> tomorrow.
>
> We would love to hear your thoughts. We will be glad to see collaborate
> with you in the implementation.
>
> Regards,
> Dhilip
>
>
> Reference:
> JIRA: https://issues.apache.org/jira/browse/MESOS-3548
> Slides:
> http://www.slideshare.net/mKrishnaKumar1/federated-mesos-clusters-for-global-data-center-designs
> Video :
> https://www.youtube.com/watch?v=kqyVQzwwD5E&index=17&list=PLGeM09tlguZQVL7ZsfNMffX9h1rGNVqnC
>
>


-- 





Alexander Gallego
Co-Founder & CTO

Reply via email to