On Thu, Jul 14, 2016 at 2:40 AM, DhilipKumar Sankaranarayanan < [email protected]> wrote:
> HI Alex, > > Thanks for taking a look. We have simplified the design since the > conference. The Allocation and Anonymous modules where only helping us to > control the offers sent to the frameworks. Now we think that Roles and > Quota in Moses elegantly solve this problem and we could take advantage of > it. > Sounds good, given that the design is entirely different now, can you share some of these thoughts. > > The current design does not propose Mesos Modules, the POC we demonstrated > @ the mesoscon is slightly out of date in that respect. > > The current design only enforces that any Policy Engine implementation > should honour certain REST apis. This also removes Consul out of the > picture, but at Huawei our implementation would pretty much consider Consul > or something similar. > > 1) Failure semantics > I do agree it is not straight forward to declare that a DC is lost just > because framework lost the connection intermittently. Probing the > 'Gossiper' we would know that the DC is still active but not just reachable > to us, In that case its worth the wait. If the DC in question is not > reachable from everyother DC, only then we could come to such conclusion. > > how do you envision frameworks integrating w/ this. Are you saying that frameworks should poll the HTTP endpoint of the Gossiper? > 2) Can you share more details about the allocator modules. > As mentioned earlier these modules are no longer relevant we have much > simpler way to achieve this. > > 3) High Availability > I think you are talking about the below section? > "Sequence Diagram for High Availability > > (Incase of local datacenter failure) > Very Similar to cloud bursting use-case scenario. " > The sequence diagram only represents flow of events in case if the current > datacenter fails and the framework needs to connect to a new one. It is > not talking about the approach you mentioned. I will update doc couple > more diagrams soon to make it more understandable. We would certainly like > to have a federated K/V storage layer across the DCs which is why Consul > was considered in the first place. > > Does this mean that you have to run the actual framework code in all of the DC's ? or you have yet to iron this out? > 4) Metrics / Monitoring - probably down the line > The experimental version of gossiper already queries the maser at a > frequent interval and exchange it amongst them. > > Ultimately DC federation is a hard problem to solve. We have plenty of > use cases which is why we wanted to reach out to the community, share our > experience and build something that is useful for all of us. > > Thanks !! excited about this work. > Regards, > Dhilip > > > On Wed, Jul 13, 2016 at 7:58 PM, Alexander Gallego <[email protected]> > wrote: > >> This is very cool work, i had a chat w/ another company thinking about >> doing the exact same thing. >> >> I think the proposal is missing several details that make it hard to >> evaluate on paper (also saw your presentation). >> >> >> 1) Failure semantics, seem to be the same from the proposed design. >> >> >> As a framework author, how do you suggest you deal w/ tasks on multiple >> clusters, i.e.: i feel like there have to be richer semantics about the >> task at least on the mesos.proto level where the state is >> STATUS_FAILED_DC_OUTAGE or smth along those lines. >> >> We respawn operators and having this information may allow me as a >> framework author to wait a little longer before trying to declare that task >> as dead (KILLED/FAILED/LOST) if I spawn it on a different data center. >> >> Would love to get details on how you were thinking of extending the >> failure semantics for multi datacenters. >> >> >> 2) Can you share more details about the allocator modules. >> >> >> After reading the proposal, I anderstand it as follows. >> >> >> [ gossiper ] -> [ allocator module ] -> [mesos master] >> >> >> Is this correct ? if so, you are saying that you can tell the mesos >> master to run a task that was fulfilled by a framework on a different data >> center? >> >> Is the constraint that you are forced to run a scheduler per framework on >> each data center? >> >> >> >> 3) High availability >> >> >> High availability on a multi dc layout means something entirely >> different. So are all frameworks now on standby on every other cluster? the >> problem i see with this is that the metadata stored by each framework to >> support HA now has to spans multiple DC's. It would be nice to perhaps at >> the mesos level extend/expose an API for setting state. >> >> a) On the normal mesos layout, this key=value data store would be >> zookeeper. >> >> b) On the multi dc layout it could be zookeeper per data center but then >> one can piggy back on the gossiper to replicate that state in the other >> data centers. >> >> >> 4) Metrics / Monitoring - probably down the line, but would be good to >> also piggy back some of the mesos master endpoints >> through the gossip architecture. >> >> >> >> Again very cool work, would love to get some more details on the actual >> implementation that you built plus some of the points above. >> >> - Alex >> >> >> >> >> >> >> >> On Wed, Jul 13, 2016 at 6:11 PM, DhilipKumar Sankaranarayanan < >> [email protected]> wrote: >> >>> Hi All, >>> >>> Please find the initial version of the Design Document >>> <https://docs.google.com/document/d/1U4IY_ObAXUPhtTa-0Rw_5zQxHDRnJFe5uFNOQ0VUcLg/edit?usp=sharing> >>> for Federating Mesos Clusters. >>> >>> >>> https://docs.google.com/document/d/1U4IY_ObAXUPhtTa-0Rw_5zQxHDRnJFe5uFNOQ0VUcLg/edit?usp=sharing >>> >>> We at Huawei had been working on this federation project for the past >>> few months. We also got an opportunity to present this in recent MesosCon >>> 2016. From further discussions and feedback we have received so far, we >>> have greatly simplified the design. >>> >>> Also I see that no one assigned to this JIRA now could i get that >>> assigned to myself ? It would be great to know if there is anyone willing >>> to shepherd this too. >>> >>> I would also like to bring this up in the community Sync that happens >>> tomorrow. >>> >>> We would love to hear your thoughts. We will be glad to see collaborate >>> with you in the implementation. >>> >>> Regards, >>> Dhilip >>> >>> >>> Reference: >>> JIRA: https://issues.apache.org/jira/browse/MESOS-3548 >>> Slides: >>> http://www.slideshare.net/mKrishnaKumar1/federated-mesos-clusters-for-global-data-center-designs >>> Video : >>> https://www.youtube.com/watch?v=kqyVQzwwD5E&index=17&list=PLGeM09tlguZQVL7ZsfNMffX9h1rGNVqnC >>> >>> >> >> >> -- >> >> >> >> >> >> Alexander Gallego >> Co-Founder & CTO >> > > -- Alexander Gallego Co-Founder & CTO

