Hi all, I work on a R&D project where I need to make two computing clusters collaborate together (an industrial datacenter & a cluster running in the cloud). One of the two clusters acts as the "main" cluster, the other as the "secondary" one. The idea is to test "bursting", i.e. when the main cluster is full it will send jobs to the secondary cluster, so that it can overcome the current load peak.
Now this is where it gets complex: as it is a small R&D project which interacts with big industrial infrastructures, we face some strict network restrictions (security oblige). We were able to have the authorization to open an outgoing SSH tunnel (from the industrial data center to the cloud), but not an ingoing one. And, of course, we have not the authorization to work around this restriction by using an outgoing tunnel in reverse mode. I guess this could be overcome if there was a way to make it work with 1-way communications, i.e. one of the two sides only (master or slave) could take care of initiating all the connections. I think the slave absolutely needs to be able to open a connection to the master, since it is him who initiate his own registration to the cluster. But on the other hand I was not 100% sure if the master also needed to initiate connections (maybe to cancel tasks). So I tried this configuration: Datacenter ===============> Cloud ^^^^^^^ ^^^^^^^^^^^^^^^^^^ ^^^^^^^ slaves 1-way SSH tunnel master The slave can reach the master in the beginning, but it doesn't work for very long because afterwards because from the master point of view it keeps switching between connected/disconnected state. I think what happens is that the slave successfully reach the master for registration, but then when the master tries to check if it is still alive (which happens periodically I guess), it can't reach it (unable to open connection) and thinks it is disconnected. Then the slave registers again, etc etc... I tried to look for a similar problem on the web, and I think I found evidences that the master needs the ability to open connections back to the slaves: https://mail-archives.apache.org/mod_mbox/mesos-user/201412.mbox/%3cca+8rcorxmr2nk-sa9ipyk_uvuyr8k7xeh_abl69r0jnb3ul...@mail.gmail.com%3E http://stackoverflow.com/a/32275220/3037171 http://stackoverflow.com/a/24559617/3037171 However, the latest link dates back to ~September 2015, and I personnally use Mesos 0.22.1 which dates back to ~May 2015. So I was wondering if this particular network behavior could be overcome with the latest versions, but I quickly read through the changelogs and didn't notice anything relative to that. I also dug into the code for several hours, but I found it hard to understand precisely how the communication architecture of Mesos works. So I'd be glad to have some insight from you guys about if it is possible, in one way or another, to make Mesos work without the Master being able to initiate connections to slaves. I just need to be 100% sure there isn't any workaround before going back to my boss :) Thank you very much for your attention! Elouan