Re: RDD operation examples with data?
I would check out the source examples on Spark's Github: https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples And, Zhen He put together a great web page with summaries and examples of each function: http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-td5529.html Hope this helps! Jacob On Thu, Jul 31, 2014 at 3:00 PM, Chris Curtin curtin.ch...@gmail.com wrote: Hi, I'm learning Spark and I am confused about when to use the many different operations on RDDs. Does anyone have any examples which show example inputs and resulting outputs for the various RDD operations and if the operation takes an Function a simple example of the code? For example, something like this for flatMap One row - the quick brown fox Passed to: JavaRDDString words = lines.flatMap(new FlatMapFunctionString, String() { @Override public IterableString call(String s) { return Arrays.asList(SPACE.split(s)); } }); When completed: words would contain the quick brown fox (Yes this one is pretty obvious but some of the others aren't). If such examples don't exist, is there a shared wiki or someplace we could start building one? Thanks, Chris
Re: spark with docker: errors with akka, NAT?
Long story [1] short, akka opens up dynamic, random ports for each job [2]. So, simple NAT fails. You might try some trickery with a DNS server and docker's --net=host . [1] http://apache-spark-user-list.1001560.n3.nabble.com/Comprehensive-Port-Configuration-reference-tt5384.html#none [2] http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Mohit Jaggi mohitja...@gmail.com To: user@spark.apache.org Date: 06/16/2014 05:36 PM Subject:spark with docker: errors with akka, NAT? Hi Folks, I am having trouble getting spark driver running in docker. If I run a pyspark example on my mac it works but the same example on a docker image (Via boot2docker) fails with following logs. I am pointing the spark driver (which is running the example) to a spark cluster (driver is not part of the cluster). I guess this has something to do with docker's networking stack (it may be getting NAT'd) but I am not sure why (if at all) the spark-worker or spark-master is trying to create a new TCP connection to the driver, instead of responding on the connection initiated by the driver. I would appreciate any help in figuring this out. Thanks, Mohit. logs Spark Executor Command: java -cp ::/home/ayasdi/spark/conf:/home//spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar -Xms2g -Xmx2g -Xms512M -Xmx512M org.apache.spark.executor.CoarseGrainedExecutorBackend akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 1 cobalt 24 akka.tcp://sparkWorker@:33952/user/Worker app-20140616152201-0021 log4j:WARN No appenders could be found for logger (org.apache.hadoop.conf.Configuration). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. 14/06/16 15:22:05 INFO SparkHadoopUtil: Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 14/06/16 15:22:05 INFO SecurityManager: Changing view acls to: ayasdi,root 14/06/16 15:22:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(xxx, xxx) 14/06/16 15:22:05 INFO Slf4jLogger: Slf4jLogger started 14/06/16 15:22:05 INFO Remoting: Starting remoting 14/06/16 15:22:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@:33536] 14/06/16 15:22:06 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@:33536] 14/06/16 15:22:06 INFO CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 14/06/16 15:22:06 INFO WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@:33952/user/Worker 14/06/16 15:22:06 WARN Remoting: Tried to associate with unreachable remote address [akka.tcp://spark@fc31887475e3:43921]. Address is now gated for 6 ms, all messages to this address will be delivered to dead letters. 14/06/16 15:22:06 ERROR CoarseGrainedExecutorBackend: Driver Disassociated [akka.tcp://sparkExecutor@:33536] - [akka.tcp://spark@fc31887475e3:43921] disassociated! Shutting down.
Re: Comprehensive Port Configuration reference?
Howdy Andrew, This is a standalone cluster. And, yes, if my understanding of Spark terminology is correct, you are correct about the port ownerships. Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Andrew Ash and...@andrewash.com To: user@spark.apache.org Date: 05/28/2014 05:18 PM Subject:Re: Comprehensive Port Configuration reference? Hmm, those do look like 4 listening ports to me. PID 3404 is an executor and PID 4762 is a worker? This is a standalone cluster? On Wed, May 28, 2014 at 8:22 AM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy Andrew, Here is what I ran before an application context was created (other services have been deleted): # netstat -l -t tcp -p --numeric-ports Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 10.90.17.100: :::* LISTEN 4762/java tcp6 0 0 :::8081 :::* LISTEN 4762/java And, then while the application context is up: # netstat -l -t tcp -p --numeric-ports Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 10.90.17.100: :::* LISTEN 4762/java tcp6 0 0 :::57286 :::* LISTEN 3404/java tcp6 0 0 10.90.17.100:38118 :::* LISTEN 3404/java tcp6 0 0 10.90.17.100:35530 :::* LISTEN 3404/java tcp6 0 0 :::60235 :::* LISTEN 3404/java tcp6 0 0 :::8081 :::* LISTEN 4762/java My understanding is that this says four ports are open. Is 57286 and 60235 not being used? Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 Inactive hide details for Andrew Ash ---05/25/2014 06:25:18 PM---Hi Jacob, The config option spark.history.ui.port is new for 1Andrew Ash ---05/25/2014 06:25:18 PM---Hi Jacob, The config option spark.history.ui.port is new for 1.0 The problem that From: Andrew Ash and...@andrewash.com To: user@spark.apache.org Date: 05/25/2014 06:25 PM Subject: Re: Comprehensive Port Configuration reference? Hi Jacob, The config option spark.history.ui.port is new for 1.0 The problem that History server solves is that in non-Standalone cluster deployment modes (Mesos and YARN) there is no long-lived Spark Master that can store logs and statistics about an application after it finishes. History server is the UI that renders logged data from applications after they complete. Read more here: https://issues.apache.org/jira/browse/SPARK-1276 and https://github.com/apache/spark/pull/204 As far as the two vs four dynamic ports, are those all listening ports? I did observe 4 ports in use, but only two of them were listening. The other two were the random ports used for responses on outbound connections, the source port of the (srcIP, srcPort, dstIP, dstPort) tuple that uniquely identifies a TCP socket. http://unix.stackexchange.com/questions/75011/how-does-the-server-find-out-what-client-port-to-send-to Thanks for taking a look through! I also realized that I had a couple mistakes with the 0.9 to 1.0 transition so appropriately documented those now as well in the updated PR. Cheers! Andrew On Fri, May 23, 2014 at 2:43 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy Andrew, I noticed you have a configuration item that we were not aware of: spark.history.ui.port . Is that new for 1.0? Also, we noticed that the Workers and the Drivers were opening up four dynamic ports per application context. It looks like you were seeing two. Everything else looks like it aligns! Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 Inactive hide details for Andrew Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also been interested in better understandingAndrew Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also been interested in better understanding what ports are used where From: Andrew Ash and...@andrewash.com To: user@spark.apache.org Date: 05/23/2014 10:30 AM Subject: Re: Comprehensive Port Configuration reference? Hi everyone, I've also been interested in better understanding what ports are used where and the direction the network connections go. I've observed a running cluster and read through code, and came up with the below documentation addition. https
Re: Comprehensive Port Configuration reference?
Howdy Andrew, Here is what I ran before an application context was created (other services have been deleted): # netstat -l -t tcp -p --numeric-ports Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 10.90.17.100: :::* LISTEN 4762/java tcp6 0 0 :::8081 :::* LISTEN 4762/java And, then while the application context is up: # netstat -l -t tcp -p --numeric-ports Active Internet connections (only servers) Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name tcp6 0 0 10.90.17.100: :::* LISTEN 4762/java tcp6 0 0 :::57286:::* LISTEN 3404/java tcp6 0 0 10.90.17.100:38118 :::* LISTEN 3404/java tcp6 0 0 10.90.17.100:35530 :::* LISTEN 3404/java tcp6 0 0 :::60235:::* LISTEN 3404/java tcp6 0 0 :::8081 :::* LISTEN 4762/java My understanding is that this says four ports are open. Is 57286 and 60235 not being used? Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Andrew Ash and...@andrewash.com To: user@spark.apache.org Date: 05/25/2014 06:25 PM Subject:Re: Comprehensive Port Configuration reference? Hi Jacob, The config option spark.history.ui.port is new for 1.0 The problem that History server solves is that in non-Standalone cluster deployment modes (Mesos and YARN) there is no long-lived Spark Master that can store logs and statistics about an application after it finishes. History server is the UI that renders logged data from applications after they complete. Read more here: https://issues.apache.org/jira/browse/SPARK-1276 and https://github.com/apache/spark/pull/204 As far as the two vs four dynamic ports, are those all listening ports? I did observe 4 ports in use, but only two of them were listening. The other two were the random ports used for responses on outbound connections, the source port of the (srcIP, srcPort, dstIP, dstPort) tuple that uniquely identifies a TCP socket. http://unix.stackexchange.com/questions/75011/how-does-the-server-find-out-what-client-port-to-send-to Thanks for taking a look through! I also realized that I had a couple mistakes with the 0.9 to 1.0 transition so appropriately documented those now as well in the updated PR. Cheers! Andrew On Fri, May 23, 2014 at 2:43 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy Andrew, I noticed you have a configuration item that we were not aware of: spark.history.ui.port . Is that new for 1.0? Also, we noticed that the Workers and the Drivers were opening up four dynamic ports per application context. It looks like you were seeing two. Everything else looks like it aligns! Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 Inactive hide details for Andrew Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also been interested in better understandingAndrew Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also been interested in better understanding what ports are used where From: Andrew Ash and...@andrewash.com To: user@spark.apache.org Date: 05/23/2014 10:30 AM Subject: Re: Comprehensive Port Configuration reference? Hi everyone, I've also been interested in better understanding what ports are used where and the direction the network connections go. I've observed a running cluster and read through code, and came up with the below documentation addition. https://github.com/apache/spark/pull/856 Scott and Jacob -- it sounds like you two have pulled together some of this yourselves for writing firewall rules. Would you mind taking a look at this pull request and confirming that it matches your observations? Wrong documentation is worse than no documentation, so I'd like to make sure this is right. Cheers, Andrew On Wed, May 7, 2014 at 10:19 AM, Mark Baker dist...@acm.org wrote: On Tue, May 6, 2014 at 9:09 AM, Jacob Eisinger jeis...@us.ibm.com wrote: In a nut shell, Spark opens up a couple of well known ports. And,then the workers and the shell open up dynamic ports for each job. These dynamic ports make securing the Spark network difficult. Indeed. Judging by the frequency with which this topic arises, this is a concern for many (myself included). I couldn't find anything in JIRA about it, but I'm curious to know whether the Spark team considers this a problem in need of a fix? Mark.
Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.
Howdy Gerard, Yeah, the docker link feature seems to work well for client-server interaction. But, peer-to-peer architectures need more for service discovery. As for you addressing requirements, I don't completely understand what you are asking for... you may also want to check out xip.io . Their wild card domains sometimes makes for an easy, neat hack. Finally for the ports, Docker's new host networking [1] feature helps out with making a Spark Docker container. (Security is still an issue.) Jacob [1] http://blog.docker.io/2014/05/docker-0-11-release-candidate-for-1-0/ Jacob Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Gerard Maas gerard.m...@gmail.com To: user@spark.apache.org Date: 05/16/2014 10:26 AM Subject:Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs. Hi Jacob, Thanks for the help answer on the docker question. Have you already experimented with the new link feature in Docker? That does not help the HDFS issue as the DataNode needs the namenode and vice-versa but it does facilitate simpler client-server interactions. My issue described at the beginning is related to networking between the host and the docker images, but I was loosing too much time tracking down the exact problem, so I moved my Spark job driver into the mesos node and it started working. Sadly, my Mesos UI is partially crippled as workers are not addressable (therefore spark job logs are hard to gather) Your discussion about dynamic port allocation is very relevant to understand why some components cannot talk with each other. I'll need to have a more in-depth read of that discussion to find a better solution for my local development environment. regards, Gerard. On Tue, May 6, 2014 at 3:30 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy, You might find the discussion Andrew and I have been having about Docker and network security [1] applicable. Also, I posted an answer [2] to your stackoverflow question. [1] http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-tp5237p5441.html [2] http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns/23495100#23495100 Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 Inactive hide details for Gerard Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a modified version of theGerard Maas ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts From: Gerard Maas gerard.m...@gmail.com To: user@spark.apache.org Date: 05/05/2014 04:18 PM Subject: Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs. Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts [1]. The amplab docker images are a good starting point. One of the biggest hurdles has been HDFS, which requires reverse-DNS and I didn't want to go the dnsmasq route to keep the containers relatively simple to use without the need of external scripts. Ended up running a 1-node setup nnode+dnode. I'm still looking for a better solution for HDFS [2] Our usecase using docker is to easily create local dev environments both for development and for automated functional testing (using cucumber). My aim is to strongly reduce the time of the develop-deploy-test cycle. That also means that we run the minimum number of instances required to have a functionally working setup. E.g. 1 Zookeeper, 1 Kafka broker, ... For the actual cluster deployment we have Chef-based devops toolchain that put things in place on public cloud providers. Personally, I think Docker rocks and would like to replace those complex cookbooks with Dockerfiles once the technology is mature enough. -greetz, Gerard. [1] https://github.com/amplab/docker-scripts [2] http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns On Mon, May 5, 2014 at 11:00 PM, Benjamin bboui...@gmail.com wrote: Hi, Before considering running on Mesos, did you try to submit the application on Spark deployed without Mesos on Docker containers ? Currently investigating this idea to deploy quickly a complete set of clusters with Docker, I'm interested by your findings on sharing the settings of Kafka and Zookeeper across nodes. How many broker and zookeeper do you use ? Regards, On Mon, May 5, 2014 at 10:11 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi all, I'm currently working on creating a set of docker images to facilitate local development with Spark/streaming on Mesos (+zk, hdfs, kafka) After solving the initial hurdles to get things
Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.
Howdy, You might find the discussion Andrew and I have been having about Docker and network security [1] applicable. Also, I posted an answer [2] to your stackoverflow question. [1] http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-tp5237p5441.html [2] http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns/23495100#23495100 Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Gerard Maas gerard.m...@gmail.com To: user@spark.apache.org Date: 05/05/2014 04:18 PM Subject:Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs. Hi Benjamin, Yes, we initially used a modified version of the AmpLabs docker scripts [1]. The amplab docker images are a good starting point. One of the biggest hurdles has been HDFS, which requires reverse-DNS and I didn't want to go the dnsmasq route to keep the containers relatively simple to use without the need of external scripts. Ended up running a 1-node setup nnode+dnode. I'm still looking for a better solution for HDFS [2] Our usecase using docker is to easily create local dev environments both for development and for automated functional testing (using cucumber). My aim is to strongly reduce the time of the develop-deploy-test cycle. That also means that we run the minimum number of instances required to have a functionally working setup. E.g. 1 Zookeeper, 1 Kafka broker, ... For the actual cluster deployment we have Chef-based devops toolchain that put things in place on public cloud providers. Personally, I think Docker rocks and would like to replace those complex cookbooks with Dockerfiles once the technology is mature enough. -greetz, Gerard. [1] https://github.com/amplab/docker-scripts [2] http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns On Mon, May 5, 2014 at 11:00 PM, Benjamin bboui...@gmail.com wrote: Hi, Before considering running on Mesos, did you try to submit the application on Spark deployed without Mesos on Docker containers ? Currently investigating this idea to deploy quickly a complete set of clusters with Docker, I'm interested by your findings on sharing the settings of Kafka and Zookeeper across nodes. How many broker and zookeeper do you use ? Regards, On Mon, May 5, 2014 at 10:11 PM, Gerard Maas gerard.m...@gmail.com wrote: Hi all, I'm currently working on creating a set of docker images to facilitate local development with Spark/streaming on Mesos (+zk, hdfs, kafka) After solving the initial hurdles to get things working together in docker containers, now everything seems to start-up correctly and the mesos UI shows slaves as they are started. I'm trying to submit a job from IntelliJ and the jobs submissions seem to get lost in Mesos translation. The logs are not helping me to figure out what's wrong, so I'm posting them here in the hope that they can ring a bell and somebdoy could provide me a hint on what's wrong/missing with my setup. DRIVER (IntelliJ running a Job.scala main) 14/05/05 21:52:31 INFO MetadataCleaner: Ran metadata cleaner for SHUFFLE_BLOCK_MANAGER 14/05/05 21:52:31 INFO BlockManager: Dropping broadcast blocks older than 1399319251962 14/05/05 21:52:31 INFO BlockManager: Dropping non broadcast blocks older than 1399319251962 14/05/05 21:52:31 INFO MetadataCleaner: Ran metadata cleaner for BROADCAST_VARS 14/05/05 21:52:31 INFO MetadataCleaner: Ran metadata cleaner for BLOCK_MANAGER 14/05/05 21:52:32 INFO MetadataCleaner: Ran metadata cleaner for HTTP_BROADCAST 14/05/05 21:52:32 INFO MetadataCleaner: Ran metadata cleaner for MAP_OUTPUT_TRACKER 14/05/05 21:52:32 INFO MetadataCleaner: Ran metadata cleaner for SPARK_CONTEXT MESOS MASTER I0505 19:52:39.718080 388 master.cpp:690] Registering framework 201405051517-67113388-5050-383-6995 at scheduler(1)@127.0.1.1:58115 I0505 19:52:39.718261 388 master.cpp:493] Framework 201405051517-67113388-5050-383-6995 disconnected I0505 19:52:39.718277 389 hierarchical_allocator_process.hpp:332] Added framework 201405051517-67113388-5050-383-6995 I0505 19:52:39.718312 388 master.cpp:520] Giving framework 201405051517-67113388-5050-383-6995 0ns to failover I0505 19:52:39.718431 389 hierarchical_allocator_process.hpp:408] Deactivated framework 201405051517-67113388-5050-383-6995 W0505 19:52:39.718459 388 master.cpp:1388] Master returning resources offered to framework 201405051517-67113388-5050-383-6995 because the framework has terminated or is inactive I0505 19:52:39.718567 388 master.cpp:1376] Framework failover timeout, removing framework 201405051517-67113388-5050-383-6995 MESOS SLAVE I0505 19:49:27.662019 20 slave.cpp:1191] Asked to shut down
Re: Comprehensive Port Configuration reference?
Howdy Scott, Please see the discussions about securing the Spark network [1] [2]. In a nut shell, Spark opens up a couple of well known ports. And,then the workers and the shell open up dynamic ports for each job. These dynamic ports make securing the Spark network difficult. Jacob [1] http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-td4832.html [2] http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-td5237.html Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Scott Clasen scott.cla...@gmail.com To: u...@spark.incubator.apache.org Date: 05/05/2014 11:39 AM Subject:Comprehensive Port Configuration reference? Is there somewhere documented how one would go about configuring every open port a spark application needs? This seems like one of the main things that make running spark hard in places like EC2 where you arent using the canned spark scripts. Starting an app looks like you'll see ports open for BlockManager OutoutTracker FileServer WebUI Local port to get callbacks from mesos master.. What else? How do I configure all of these? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Comprehensive-Port-Configuration-reference-tp5384.html Sent from the Apache Spark User List mailing list archive at Nabble.com.
RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication
Howdy Andrew, I think I am running into the same issue [1] as you. It appears that Spark opens up dynamic / ephemera [2] ports for each job on the shell and the workers. As you are finding out, this makes securing and managing the network for Spark very difficult. Any idea how to restrict the 'Workers' port range? The port range can be found by running: $ sysctl net.ipv4.ip_local_port_range net.ipv4.ip_local_port_range = 32768 61000 With that being said, a couple avenues you may try: Limit the dynamic ports [3] to a more reasonable number and open all of these ports on your firewall; obviously, this might have unintended consequences like port exhaustion. Secure the network another way like through a private VPN; this may reduce Spark's performance. If you have other workarounds, I am all ears --- please let me know! Jacob [1] http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html [2] http://en.wikipedia.org/wiki/Ephemeral_port [3] http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Andrew Lee alee...@hotmail.com To: user@spark.apache.org user@spark.apache.org Date: 05/02/2014 03:15 PM Subject:RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication Hi Yana, I did. I configured the the port in spark-env.sh, the problem is not the driver port which is fixed. it's the Workers port that are dynamic every time when they are launched in the YARN container. :-( Any idea how to restrict the 'Workers' port range? Date: Fri, 2 May 2014 14:49:23 -0400 Subject: Re: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication From: yana.kadiy...@gmail.com To: user@spark.apache.org I think what you want to do is set spark.driver.port to a fixed port. On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote: Hi All, I encountered this problem when the firewall is enabled between the spark-shell and the Workers. When I launch spark-shell in yarn-client mode, I notice that Workers on the YARN containers are trying to talk to the driver (spark-shell), however, the firewall is not opened and caused timeout. For the Workers, it tries to open listening ports on 54xxx for each Worker? Is the port random in such case? What will be the better way to predict the ports so I can configure the firewall correctly between the driver (spark-shell) and the Workers? Is there a range of ports we can specify in the firewall/iptables? Any ideas?
Re: Securing Spark's Network
Howdy Akhil, Thanks - that did help! And, it made me think about how the EC2 scripts work [1] to set up security. From my understanding of EC2 security groups [2], this just sets up external access, right? (This has no effect on internal communication between the instances, right?) I am still confused as to why I am seeing the workers open up new ports for each job. Jacob [1] https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L230 [2] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#default-security-group Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075 From: Akhil Das ak...@sigmoidanalytics.com To: user@spark.apache.org Date: 04/25/2014 12:51 PM Subject:Re: Securing Spark's Network Sent by:ak...@mobipulse.in Hi Jacob, This post might give you a brief idea about the ports being used https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger jeis...@us.ibm.com wrote: Howdy, We tried running Spark 0.9.1 stand-alone inside docker containers distributed over multiple hosts. This is complicated due to Spark opening up ephemeral / dynamic ports for the workers and the CLI. To ensure our docker solution doesn't break Spark in unexpected ways and maintains a secure cluster, I am interested in understanding more about Spark's network architecture. I'd appreciate it if you could you point us to any documentation! A couple specific questions: 1. What are these ports being used for? Checking out the code / experiments, it looks like asynchronous communication for shuffling around results. Anything else? 2. How do you secure the network? Network administrators tend to secure and monitor the network at the port level. If these ports are dynamic and open randomly, firewalls are not easily configured and security alarms are raised. Is there a way to limit the range easily? (We did investigate setting the kernel parameter ip_local_reserved_ports, but this is broken [1] on some versions of Linux's cgroups.) Thanks, Jacob [1] https://github.com/lxc/lxc/issues/97 Jacob D. Eisinger IBM Emerging Technologies jeis...@us.ibm.com - (512) 286-6075