I am currently using mesos as a big data backend for spark, cassandra,
kafka and elasticsearch but I cannot find a good overall design regarding
service discovery. I explain:
Generally, the service discovery is managed by a HAproxy instance on each
node which redirect trafic from service ports to real assigned network
ports. Currently I am not using it because the cluster is quite small and I
don't need to deploy lots of service but I am thinking on futur design that
will allows me to scale.
The problem with HAproxy dealing with all network trafic is that I am
afraid it will break the data locality which is so important in the big
data world regarding performances.
For example when Spark tries to connect to elasticsearch, it will discover
the elasticsearch topology and try to launch tasks next to elasticsearch
shards. If HAproxy intercept network flows, what would be the result ?
Will HAproxy masquarade the elasticsearch  IP/ports ? Same thing for Kafka
and Cassandra ?

I assume it depends on each connector but it's very hard to find any
information. Thanks for your help if you have any experience in it.
Regards

Reply via email to