So, no, I don't have Elastic Search in HA Proxy. For each instance of Elastic Search I have, I specify the ports to user (a range) Now, Spark can't do service discovery of Elastic serach the way Elastic Search can, so that could be a challenge. That said, each ES node can be connected to directly, so perhaps registering each node and using Mesos-DNS with static ports. Another option is to have your spark app do a little bit of service discovery of your own. Perhaps specify a range of ports that Elastic Search COULD be running on and then some nodes it could be running on and have Spark go guess and check. Since it's your app, you can put what ever logic you want. I guess, what I am saying is there is nothing built into Spark or ES to what you want, but between Mesos-DNS, your ability to customize code in Spark, there should be a clever way to approach that fits your Environment.
On Wed, Dec 30, 2015 at 12:10 PM, vincent gromakowski < [email protected]> wrote: > Can you confirm what I understand ? Spark will connect to Elasticsearch > through the service port (means HApoxy) and then will get direct IP/ports > for the topology? > > 2015-12-30 19:06 GMT+01:00 John Omernik <[email protected]>: > >> I would say that service discovery is only for those services that don't >> have a built in method for discovery. When I run Elastic Search, I specify >> the port range I can start elastic search in, and let it run. If the port >> is taken, it tries a different one (I am using the Elastic Search for Yarn >> package running on Apache Myriad). Since I know which nodes and what port >> ranges to use, I just add that to my Elastic Search config, and thus HA >> proxy is not intercepting that traffic. If I have a front end running in >> Flask that connects to the ES back end, then I would use Mesos-DNS with >> HAProxy to solve that problem. In addition, Spark as a framework does the >> service discovery, HA Proxy wouldn't be getting inbetween spark nodes, same >> with Kafka (I haven't played with Cassandra yet). >> >> There is some work being done on IP per container which will help this as >> well, but all in all, I've found that as long I am some what smart about my >> frameworks, I can manage them (my cluster isn't huge either). As things >> grow, I am hoping to grow into IP per container. >> >> John >> >> >> On Wed, Dec 30, 2015 at 11:56 AM, vincent gromakowski < >> [email protected]> wrote: >> >>> I am currently using mesos as a big data backend for spark, cassandra, >>> kafka and elasticsearch but I cannot find a good overall design regarding >>> service discovery. I explain: >>> Generally, the service discovery is managed by a HAproxy instance on >>> each node which redirect trafic from service ports to real assigned network >>> ports. Currently I am not using it because the cluster is quite small and I >>> don't need to deploy lots of service but I am thinking on futur design that >>> will allows me to scale. >>> The problem with HAproxy dealing with all network trafic is that I am >>> afraid it will break the data locality which is so important in the big >>> data world regarding performances. >>> For example when Spark tries to connect to elasticsearch, it will >>> discover the elasticsearch topology and try to launch tasks next to >>> elasticsearch shards. If HAproxy intercept network flows, what would be the >>> result ? Will HAproxy masquarade the elasticsearch IP/ports ? Same thing >>> for Kafka and Cassandra ? >>> >>> I assume it depends on each connector but it's very hard to find any >>> information. Thanks for your help if you have any experience in it. >>> Regards >>> >>> >>> >> >

