Hi all,

I work on a R&D project where I need to make two computing clusters
collaborate together (an industrial datacenter & a cluster running in the
cloud). One of the two clusters acts as the "main" cluster, the other as
the "secondary" one. The idea is to test "bursting", i.e. when the main
cluster is full it will send jobs to the secondary cluster, so that it can
overcome the current load peak.

Now this is where it gets complex: as it is a small R&D project which
interacts with big industrial infrastructures, we face some strict network
restrictions (security oblige). We were able to have the authorization to
open an outgoing SSH tunnel (from the industrial data center to the cloud),
but not an ingoing one. And, of course, we have not the authorization to
work around this restriction by using an outgoing tunnel in reverse mode.

I guess this could be overcome if there was a way to make it work with
1-way communications, i.e. one of the two sides only (master or slave)
could take care of initiating all the connections. I think the slave
absolutely needs to be able to open a connection to the master, since it is
him who initiate his own registration to the cluster. But on the other hand
I was not 100% sure if the master also needed to initiate connections
(maybe to cancel tasks).

So I tried this configuration:

Datacenter ===============> Cloud

^^^^^^^      ^^^^^^^^^^^^^^^^^^     ^^^^^^^
slaves       1-way SSH tunnel      master

The slave can reach the master in the beginning, but it doesn't work for
very long because afterwards because from the master point of view it keeps
switching between connected/disconnected state. I think what happens is
that the slave successfully reach the master for registration, but then
when the master tries to check if it is still alive (which happens
periodically I guess), it can't reach it (unable to open connection) and
thinks it is disconnected. Then the slave registers again, etc etc...

I tried to look for a similar problem on the web, and I think I found
evidences that the master needs the ability to open connections back to the
slaves:

https://mail-archives.apache.org/mod_mbox/mesos-user/201412.mbox/%3cca+8rcorxmr2nk-sa9ipyk_uvuyr8k7xeh_abl69r0jnb3ul...@mail.gmail.com%3E

http://stackoverflow.com/a/32275220/3037171

http://stackoverflow.com/a/24559617/3037171


However, the latest link dates back to ~September 2015, and I personnally
use Mesos 0.22.1 which dates back to ~May 2015. So I was wondering if this
particular network behavior could be overcome with the latest versions, but
I quickly read through the changelogs and didn't notice anything relative
to that. I also dug into the code for several hours, but I found it hard to
understand precisely how the communication architecture of Mesos works.


So I'd be glad to have some insight from you guys about if it is possible,
in one way or another, to make Mesos work without the Master being able to
initiate connections to slaves. I just need to be 100% sure there isn't any
workaround before going back to my boss :)


Thank you very much for your attention!

Elouan

Reply via email to