Re: RDD operation examples with data?

2014-07-31 Thread Jacob Eisinger
I would check out the source examples on Spark's Github:
https://github.com/apache/spark/tree/master/examples/src/main/scala/org/apache/spark/examples

And, Zhen He put together a great web page with summaries and examples of
each function:
http://apache-spark-user-list.1001560.n3.nabble.com/A-new-resource-for-getting-examples-of-Spark-RDD-API-calls-td5529.html

Hope this helps!

Jacob


On Thu, Jul 31, 2014 at 3:00 PM, Chris Curtin curtin.ch...@gmail.com
wrote:

 Hi,

 I'm learning Spark and I am confused about when to use the many different
 operations on RDDs. Does anyone have any examples which show example inputs
 and resulting outputs for the various RDD operations and if the operation
 takes an Function a simple example of the code?

 For example, something like this for flatMap

 One row - the quick brown fox

 Passed to:

 JavaRDDString words = lines.flatMap(new FlatMapFunctionString, String() {
   @Override
   public IterableString call(String s) {
 return Arrays.asList(SPACE.split(s));
   }
 });

 When completed: words would contain
 the
 quick
 brown
 fox

 (Yes this one is pretty obvious but some of the others aren't).

 If such examples don't exist, is there a shared wiki or someplace we could 
 start building one?

 Thanks,

 Chris




Re: spark with docker: errors with akka, NAT?

2014-06-17 Thread Jacob Eisinger

Long story [1] short, akka opens up dynamic, random ports for each job [2].
So, simple NAT fails.  You might try some trickery with a DNS server and
docker's --net=host .


[1]
http://apache-spark-user-list.1001560.n3.nabble.com/Comprehensive-Port-Configuration-reference-tt5384.html#none
[2]
http://spark.apache.org/docs/latest/spark-standalone.html#configuring-ports-for-network-security

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Mohit Jaggi mohitja...@gmail.com
To: user@spark.apache.org
Date:   06/16/2014 05:36 PM
Subject:spark with docker: errors with akka, NAT?



Hi Folks,


I am having trouble getting spark driver running in docker. If I run a
pyspark example on my mac it works but the same example on a docker image
(Via boot2docker) fails with following logs. I am pointing the spark driver
(which is running the example) to a spark cluster (driver is not part of
the cluster). I guess this has something to do with docker's networking
stack (it may be getting NAT'd) but I am not sure why (if at all) the
spark-worker or spark-master is trying to create a new TCP connection to
the driver, instead of responding on the connection initiated by the
driver.


I would appreciate any help in figuring this out.


Thanks,


Mohit.


logs


Spark Executor Command: java -cp
::/home/ayasdi/spark/conf:/home//spark/assembly/target/scala-2.10/spark-assembly-1.0.0-hadoop1.0.4.jar
 -Xms2g -Xmx2g -Xms512M -Xmx512M
org.apache.spark.executor.CoarseGrainedExecutorBackend
akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler 1
cobalt 24 akka.tcp://sparkWorker@:33952/user/Worker
app-20140616152201-0021








log4j:WARN No appenders could be found for logger
(org.apache.hadoop.conf.Configuration).


log4j:WARN Please initialize the log4j system properly.


log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.


14/06/16 15:22:05 INFO SparkHadoopUtil: Using Spark's default log4j
profile: org/apache/spark/log4j-defaults.properties


14/06/16 15:22:05 INFO SecurityManager: Changing view acls to: ayasdi,root


14/06/16 15:22:05 INFO SecurityManager: SecurityManager: authentication
disabled; ui acls disabled; users with view permissions: Set(xxx, xxx)


14/06/16 15:22:05 INFO Slf4jLogger: Slf4jLogger started


14/06/16 15:22:05 INFO Remoting: Starting remoting


14/06/16 15:22:06 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://sparkExecutor@:33536]


14/06/16 15:22:06 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkExecutor@:33536]


14/06/16 15:22:06 INFO CoarseGrainedExecutorBackend: Connecting to driver:
akka.tcp://spark@fc31887475e3:43921/user/CoarseGrainedScheduler


14/06/16 15:22:06 INFO WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@:33952/user/Worker


14/06/16 15:22:06 WARN Remoting: Tried to associate with unreachable remote
address [akka.tcp://spark@fc31887475e3:43921]. Address is now gated for
6 ms, all messages to this address will be delivered to dead letters.


14/06/16 15:22:06 ERROR CoarseGrainedExecutorBackend: Driver Disassociated
[akka.tcp://sparkExecutor@:33536] -
[akka.tcp://spark@fc31887475e3:43921] disassociated! Shutting down.



Re: Comprehensive Port Configuration reference?

2014-05-29 Thread Jacob Eisinger

Howdy Andrew,

This is a standalone cluster.  And, yes, if my understanding of Spark
terminology is correct, you are correct about the port ownerships.

Jacob

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Andrew Ash and...@andrewash.com
To: user@spark.apache.org
Date:   05/28/2014 05:18 PM
Subject:Re: Comprehensive Port Configuration reference?



Hmm, those do look like 4 listening ports to me.  PID 3404 is an executor
and PID 4762 is a worker?  This is a standalone cluster?


On Wed, May 28, 2014 at 8:22 AM, Jacob Eisinger jeis...@us.ibm.com wrote:
  Howdy Andrew,

  Here is what I ran before an application context was created (other
  services have been deleted):


# netstat -l -t tcp -p  --numeric-ports

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address           Foreign Address
State       PID/Program name
tcp6       0      0 10.90.17.100:       :::*
LISTEN      4762/java
tcp6       0      0 :::8081                 :::*
LISTEN      4762/java

  And, then while the application context is up:
# netstat -l -t tcp -p  --numeric-ports

Active Internet connections (only servers)

Proto Recv-Q Send-Q Local Address           Foreign Address
State       PID/Program name
tcp6       0      0 10.90.17.100:       :::*
LISTEN      4762/java
tcp6       0      0 :::57286                :::*
LISTEN      3404/java
tcp6       0      0 10.90.17.100:38118      :::*
LISTEN      3404/java
tcp6       0      0 10.90.17.100:35530      :::*
LISTEN      3404/java
tcp6       0      0 :::60235                :::*
LISTEN      3404/java
tcp6       0      0 :::8081                 :::*
LISTEN      4762/java

  My understanding is that this says four ports are open.  Is 57286 and
  60235 not being used?


  Jacob

  Jacob D. Eisinger
  IBM Emerging Technologies
  jeis...@us.ibm.com - (512) 286-6075

  Inactive hide details for Andrew Ash ---05/25/2014 06:25:18 PM---Hi
  Jacob, The config option spark.history.ui.port is new for 1Andrew Ash
  ---05/25/2014 06:25:18 PM---Hi Jacob, The config option
  spark.history.ui.port is new for 1.0  The problem that


  From: Andrew Ash and...@andrewash.com
  To: user@spark.apache.org
  Date: 05/25/2014 06:25 PM

  Subject: Re: Comprehensive Port Configuration reference?



  Hi Jacob,

  The config option spark.history.ui.port is new for 1.0  The problem that
  History server solves is that in non-Standalone cluster deployment modes
  (Mesos and YARN) there is no long-lived Spark Master that can store logs
  and statistics about an application after it finishes.  History server is
  the UI that renders logged data from applications after they complete.

  Read more here: https://issues.apache.org/jira/browse/SPARK-1276 and
  https://github.com/apache/spark/pull/204

  As far as the two vs four dynamic ports, are those all listening ports?
  I did observe 4 ports in use, but only two of them were listening.  The
  other two were the random ports used for responses on outbound
  connections, the source port of the (srcIP, srcPort, dstIP, dstPort)
  tuple that uniquely identifies a TCP socket.

  
http://unix.stackexchange.com/questions/75011/how-does-the-server-find-out-what-client-port-to-send-to


  Thanks for taking a look through!

  I also realized that I had a couple mistakes with the 0.9 to 1.0
  transition so appropriately documented those now as well in the updated
  PR.

  Cheers!
  Andrew



  On Fri, May 23, 2014 at 2:43 PM, Jacob Eisinger jeis...@us.ibm.com
  wrote:
Howdy Andrew,

I noticed you have a configuration item that we were not aware of:
spark.history.ui.port .  Is that new for 1.0?

Also, we noticed that the Workers and the Drivers were opening up
four dynamic ports per application context.  It looks like you were
seeing two.

Everything else looks like it aligns!
Jacob




Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075

Inactive hide details for Andrew Ash ---05/23/2014 10:30:58 AM---Hi
everyone, I've also been interested in better understandingAndrew
Ash ---05/23/2014 10:30:58 AM---Hi everyone, I've also been
interested in better understanding what ports are used where

From: Andrew Ash and...@andrewash.com
To: user@spark.apache.org
Date: 05/23/2014 10:30 AM
Subject: Re: Comprehensive Port Configuration reference?





Hi everyone,

I've also been interested in better understanding what ports are
used where and the direction the network connections go.  I've
observed a running cluster and read through code, and came up with
the below documentation addition.

https

Re: Comprehensive Port Configuration reference?

2014-05-28 Thread Jacob Eisinger

Howdy Andrew,

Here is what I ran before an application context was created (other
services have been deleted):
   # netstat -l -t tcp -p  --numeric-ports
   Active Internet connections (only servers)
   Proto Recv-Q Send-Q Local Address   Foreign Address
   State   PID/Program name
   tcp6   0  0 10.90.17.100:   :::*
   LISTEN  4762/java
   tcp6   0  0 :::8081 :::*
   LISTEN  4762/java

And, then while the application context is up:
   # netstat -l -t tcp -p  --numeric-ports
   Active Internet connections (only servers)
   Proto Recv-Q Send-Q Local Address   Foreign Address
   State   PID/Program name
   tcp6   0  0 10.90.17.100:   :::*
   LISTEN  4762/java
   tcp6   0  0 :::57286:::*
   LISTEN  3404/java
   tcp6   0  0 10.90.17.100:38118  :::*
   LISTEN  3404/java
   tcp6   0  0 10.90.17.100:35530  :::*
   LISTEN  3404/java
   tcp6   0  0 :::60235:::*
   LISTEN  3404/java
   tcp6   0  0 :::8081 :::*
   LISTEN  4762/java

My understanding is that this says four ports are open.  Is 57286 and 60235
not being used?

Jacob

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Andrew Ash and...@andrewash.com
To: user@spark.apache.org
Date:   05/25/2014 06:25 PM
Subject:Re: Comprehensive Port Configuration reference?



Hi Jacob,

The config option spark.history.ui.port is new for 1.0  The problem that
History server solves is that in non-Standalone cluster deployment modes
(Mesos and YARN) there is no long-lived Spark Master that can store logs
and statistics about an application after it finishes.  History server is
the UI that renders logged data from applications after they complete.

Read more here: https://issues.apache.org/jira/browse/SPARK-1276 and
https://github.com/apache/spark/pull/204

As far as the two vs four dynamic ports, are those all listening ports?  I
did observe 4 ports in use, but only two of them were listening.  The other
two were the random ports used for responses on outbound connections, the
source port of the (srcIP, srcPort, dstIP, dstPort) tuple that uniquely
identifies a TCP socket.

http://unix.stackexchange.com/questions/75011/how-does-the-server-find-out-what-client-port-to-send-to

Thanks for taking a look through!

I also realized that I had a couple mistakes with the 0.9 to 1.0 transition
so appropriately documented those now as well in the updated PR.

Cheers!
Andrew



On Fri, May 23, 2014 at 2:43 PM, Jacob Eisinger jeis...@us.ibm.com wrote:
  Howdy Andrew,

  I noticed you have a configuration item that we were not aware of:
  spark.history.ui.port .  Is that new for 1.0?

  Also, we noticed that the Workers and the Drivers were opening up four
  dynamic ports per application context.  It looks like you were seeing
  two.

  Everything else looks like it aligns!
  Jacob




  Jacob D. Eisinger
  IBM Emerging Technologies
  jeis...@us.ibm.com - (512) 286-6075

  Inactive hide details for Andrew Ash ---05/23/2014 10:30:58 AM---Hi
  everyone, I've also been interested in better understandingAndrew Ash
  ---05/23/2014 10:30:58 AM---Hi everyone, I've also been interested in
  better understanding what ports are used where

  From: Andrew Ash and...@andrewash.com
  To: user@spark.apache.org
  Date: 05/23/2014 10:30 AM
  Subject: Re: Comprehensive Port Configuration reference?



  Hi everyone,

  I've also been interested in better understanding what ports are used
  where and the direction the network connections go.  I've observed a
  running cluster and read through code, and came up with the below
  documentation addition.

  https://github.com/apache/spark/pull/856

  Scott and Jacob -- it sounds like you two have pulled together some of
  this yourselves for writing firewall rules.  Would you mind taking a look
  at this pull request and confirming that it matches your observations?
  Wrong documentation is worse than no documentation, so I'd like to make
  sure this is right.

  Cheers,
  Andrew


  On Wed, May 7, 2014 at 10:19 AM, Mark Baker dist...@acm.org wrote:
On Tue, May 6, 2014 at 9:09 AM, Jacob Eisinger jeis...@us.ibm.com
wrote:
 In a nut shell, Spark opens up a couple of well known ports.
And,then the workers and the shell open up dynamic ports for each
job.  These dynamic ports make securing the Spark network
difficult.

Indeed.

Judging by the frequency with which this topic arises, this is a
concern for many (myself included).

I couldn't find anything in JIRA about it, but I'm curious to know
whether the Spark team considers this a problem in need of a fix?

Mark.







Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-20 Thread Jacob Eisinger

Howdy Gerard,

Yeah, the docker link feature seems to work well for client-server
interaction.  But, peer-to-peer architectures need more for service
discovery.

As for you addressing requirements, I don't completely understand what you
are asking for... you may also want to check out xip.io .  Their wild card
domains sometimes makes for an easy, neat hack.

Finally for the ports, Docker's new host networking [1] feature helps out
with making a Spark Docker container.  (Security is still an issue.)

Jacob

[1] http://blog.docker.io/2014/05/docker-0-11-release-candidate-for-1-0/

Jacob

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Gerard Maas gerard.m...@gmail.com
To: user@spark.apache.org
Date:   05/16/2014 10:26 AM
Subject:Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't
submit jobs.



Hi Jacob,

Thanks for the help  answer on the docker question. Have you already
experimented with the new link feature in Docker? That does not help the
HDFS issue as the DataNode needs the namenode and vice-versa but it does
facilitate simpler client-server interactions.

My issue described at the beginning is  related to networking between the
host and the docker images, but I was loosing too much time tracking down
the exact problem, so I moved my Spark job driver into the mesos node and
it started working.  Sadly, my Mesos UI is partially crippled as workers
are not addressable (therefore spark job logs are hard to gather)

Your discussion about dynamic port allocation is very relevant to
understand why some components cannot talk with each other.  I'll need to
have a more in-depth read of that discussion to  find a better solution for
my local development environment.

regards,  Gerard.



On Tue, May 6, 2014 at 3:30 PM, Jacob Eisinger jeis...@us.ibm.com wrote:
  Howdy,

  You might find the discussion Andrew and I have been having about Docker
  and network security [1] applicable.

  Also, I posted an answer [2] to your stackoverflow question.

  [1]
  
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-tp5237p5441.html

  [2]
  
http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns/23495100#23495100


  Jacob D. Eisinger
  IBM Emerging Technologies
  jeis...@us.ibm.com - (512) 286-6075

  Inactive hide details for Gerard Maas ---05/05/2014 04:18:08 PM---Hi
  Benjamin, Yes, we initially used a modified version of theGerard Maas
  ---05/05/2014 04:18:08 PM---Hi Benjamin, Yes, we initially used a
  modified version of the AmpLabs docker scripts

  From: Gerard Maas gerard.m...@gmail.com
  To: user@spark.apache.org
  Date: 05/05/2014 04:18 PM
  Subject: Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't
  submit jobs.




  Hi Benjamin,

  Yes, we initially used a modified version of the AmpLabs docker scripts
  [1]. The amplab docker images are a good starting point.
  One of the biggest hurdles has been HDFS, which requires reverse-DNS and
  I didn't want to go the dnsmasq route to keep the containers relatively
  simple to use without the need of external scripts. Ended up running a
  1-node setup nnode+dnode. I'm still looking for a better solution for
  HDFS [2]

  Our usecase using docker is to easily create local dev environments both
  for development and for automated functional testing (using cucumber). My
  aim is to strongly reduce the time of the develop-deploy-test cycle.
  That  also means that we run the minimum number of instances required to
  have a functionally working setup. E.g. 1 Zookeeper, 1 Kafka broker, ...

  For the actual cluster deployment we have Chef-based devops toolchain
  that  put things in place on public cloud providers.
  Personally, I think Docker rocks and would like to replace those complex
  cookbooks with Dockerfiles once the technology is mature enough.

  -greetz, Gerard.

  [1] https://github.com/amplab/docker-scripts
  [2]
  
http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns



  On Mon, May 5, 2014 at 11:00 PM, Benjamin bboui...@gmail.com wrote:
Hi,

Before considering running on Mesos, did you try to submit the
application on Spark deployed without Mesos on Docker containers ?

Currently investigating this idea to deploy quickly a complete set
of clusters with Docker, I'm interested by your findings on sharing
the settings of Kafka and Zookeeper across nodes. How many broker
and zookeeper do you use ?

Regards,



On Mon, May 5, 2014 at 10:11 PM, Gerard Maas gerard.m...@gmail.com
 wrote:
  Hi all,

  I'm currently working on creating a set of docker images to
  facilitate local development with Spark/streaming on Mesos
  (+zk, hdfs, kafka)

  After solving the initial hurdles to get things

Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't submit jobs.

2014-05-06 Thread Jacob Eisinger

Howdy,

You might find the discussion Andrew and I have been having about Docker
and network security [1] applicable.

Also, I posted an answer [2] to your stackoverflow question.

[1]
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-tp5237p5441.html
[2]
http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns/23495100#23495100

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Gerard Maas gerard.m...@gmail.com
To: user@spark.apache.org
Date:   05/05/2014 04:18 PM
Subject:Re: Local Dev Env with Mesos + Spark Streaming on Docker: Can't
submit jobs.



Hi Benjamin,

Yes, we initially used a modified version of the AmpLabs docker scripts
[1]. The amplab docker images are a good starting point.
One of the biggest hurdles has been HDFS, which requires reverse-DNS and I
didn't want to go the dnsmasq route to keep the containers relatively
simple to use without the need of external scripts. Ended up running a
1-node setup nnode+dnode. I'm still looking for a better solution for HDFS
[2]

Our usecase using docker is to easily create local dev environments both
for development and for automated functional testing (using cucumber). My
aim is to strongly reduce the time of the develop-deploy-test cycle.
That  also means that we run the minimum number of instances required to
have a functionally working setup. E.g. 1 Zookeeper, 1 Kafka broker, ...

For the actual cluster deployment we have Chef-based devops toolchain that
put things in place on public cloud providers.
Personally, I think Docker rocks and would like to replace those complex
cookbooks with Dockerfiles once the technology is mature enough.

-greetz, Gerard.

[1] https://github.com/amplab/docker-scripts
[2]
http://stackoverflow.com/questions/23410505/how-to-run-hdfs-cluster-without-dns


On Mon, May 5, 2014 at 11:00 PM, Benjamin bboui...@gmail.com wrote:
  Hi,

  Before considering running on Mesos, did you try to submit the
  application on Spark deployed without Mesos on Docker containers ?

  Currently investigating this idea to deploy quickly a complete set of
  clusters with Docker, I'm interested by your findings on sharing the
  settings of Kafka and Zookeeper across nodes. How many broker and
  zookeeper do you use ?

  Regards,



  On Mon, May 5, 2014 at 10:11 PM, Gerard Maas gerard.m...@gmail.com
  wrote:
   Hi all,

   I'm currently working on creating a set of docker images to facilitate
   local development with Spark/streaming on Mesos (+zk, hdfs, kafka)

   After solving the initial hurdles to get things working together in
   docker containers, now everything seems to start-up correctly and the
   mesos UI shows slaves as they are started.

   I'm trying to submit a job from IntelliJ and the jobs submissions seem
   to get lost in Mesos translation. The logs are not helping me to figure
   out what's wrong, so I'm posting them here in the hope that they can
   ring a bell and somebdoy could provide me a hint on what's wrong/missing
   with my setup.


    DRIVER (IntelliJ running a Job.scala main) 
   14/05/05 21:52:31 INFO MetadataCleaner: Ran metadata cleaner for
   SHUFFLE_BLOCK_MANAGER
   14/05/05 21:52:31 INFO BlockManager: Dropping broadcast blocks older
   than 1399319251962
   14/05/05 21:52:31 INFO BlockManager: Dropping non broadcast blocks older
   than 1399319251962
   14/05/05 21:52:31 INFO MetadataCleaner: Ran metadata cleaner for
   BROADCAST_VARS
   14/05/05 21:52:31 INFO MetadataCleaner: Ran metadata cleaner for
   BLOCK_MANAGER
   14/05/05 21:52:32 INFO MetadataCleaner: Ran metadata cleaner for
   HTTP_BROADCAST
   14/05/05 21:52:32 INFO MetadataCleaner: Ran metadata cleaner for
   MAP_OUTPUT_TRACKER
   14/05/05 21:52:32 INFO MetadataCleaner: Ran metadata cleaner for
   SPARK_CONTEXT


    MESOS MASTER 
   I0505 19:52:39.718080   388 master.cpp:690] Registering framework
   201405051517-67113388-5050-383-6995 at scheduler(1)@127.0.1.1:58115
   I0505 19:52:39.718261   388 master.cpp:493] Framework
   201405051517-67113388-5050-383-6995 disconnected
   I0505 19:52:39.718277   389 hierarchical_allocator_process.hpp:332]
   Added framework 201405051517-67113388-5050-383-6995
   I0505 19:52:39.718312   388 master.cpp:520] Giving framework
   201405051517-67113388-5050-383-6995 0ns to failover
   I0505 19:52:39.718431   389 hierarchical_allocator_process.hpp:408]
   Deactivated framework 201405051517-67113388-5050-383-6995
   W0505 19:52:39.718459   388 master.cpp:1388] Master returning resources
   offered to framework 201405051517-67113388-5050-383-6995 because the
   framework has terminated or is inactive
   I0505 19:52:39.718567   388 master.cpp:1376] Framework failover timeout,
   removing framework 201405051517-67113388-5050-383-6995



    MESOS SLAVE 
   I0505 19:49:27.662019    20 slave.cpp:1191] Asked to shut down 

Re: Comprehensive Port Configuration reference?

2014-05-06 Thread Jacob Eisinger

Howdy Scott,

Please see the discussions about securing the Spark network [1] [2].

In a nut shell, Spark opens up a couple of well known ports.  And,then the
workers and the shell open up dynamic ports for each job.  These dynamic
ports make securing the Spark network difficult.

Jacob

[1]
http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-td4832.html
[2]
http://apache-spark-user-list.1001560.n3.nabble.com/spark-shell-driver-interacting-with-Workers-in-YARN-mode-firewall-blocking-communication-td5237.html

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Scott Clasen scott.cla...@gmail.com
To: u...@spark.incubator.apache.org
Date:   05/05/2014 11:39 AM
Subject:Comprehensive Port Configuration reference?



Is there somewhere documented how one would go about configuring every open
port a spark application needs?

This seems like one of the main things that make running spark hard in
places like EC2 where you arent using the canned spark scripts.

Starting an app looks like you'll see ports open for

BlockManager
OutoutTracker
FileServer
WebUI
Local port to get callbacks from mesos master..

What else?

How do I configure all of these?



--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Comprehensive-Port-Configuration-reference-tp5384.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



RE: spark-shell driver interacting with Workers in YARN mode - firewall blocking communication

2014-05-02 Thread Jacob Eisinger

Howdy Andrew,

I think I am running into the same issue [1] as you.  It appears that Spark
opens up dynamic / ephemera [2] ports for each job on the shell and the
workers.  As you are finding out, this makes securing and managing the
network for Spark very difficult.

 Any idea how to restrict the 'Workers' port range?
The port range can be found by running:
   $ sysctl net.ipv4.ip_local_port_range
   net.ipv4.ip_local_port_range = 32768 61000

With that being said, a couple avenues you may try:
  Limit the dynamic ports [3] to a more reasonable number and open all
  of these ports on your firewall; obviously, this might have
  unintended consequences like port exhaustion.
  Secure the network another way like through a private VPN; this may
  reduce Spark's performance.

If you have other workarounds, I am all ears --- please let me know!
Jacob

[1]
http://apache-spark-user-list.1001560.n3.nabble.com/Securing-Spark-s-Network-tp4832p4984.html
[2] http://en.wikipedia.org/wiki/Ephemeral_port
[3]
http://www.cyberciti.biz/tips/linux-increase-outgoing-network-sockets-range.html

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Andrew Lee alee...@hotmail.com
To: user@spark.apache.org user@spark.apache.org
Date:   05/02/2014 03:15 PM
Subject:RE: spark-shell driver interacting with Workers in YARN mode -
firewall blocking communication



Hi Yana,

I did. I configured the the port in spark-env.sh, the problem is not the
driver port which is fixed.
it's the Workers port that are dynamic every time when they are launched in
the YARN container. :-(

Any idea how to restrict the 'Workers' port range?

Date: Fri, 2 May 2014 14:49:23 -0400
Subject: Re: spark-shell driver interacting with Workers in YARN mode -
firewall blocking communication
From: yana.kadiy...@gmail.com
To: user@spark.apache.org

I think what you want to do is set spark.driver.port to a fixed port.


On Fri, May 2, 2014 at 1:52 PM, Andrew Lee alee...@hotmail.com wrote:
  Hi All,

  I encountered this problem when the firewall is enabled between the
  spark-shell and the Workers.

  When I launch spark-shell in yarn-client mode, I notice that Workers
  on the YARN containers are trying to talk to the driver
  (spark-shell), however, the firewall is not opened and caused
  timeout.

  For the Workers, it tries to open listening ports on 54xxx for each
  Worker? Is the port random in such case?
  What will be the better way to predict the ports so I can configure
  the firewall correctly between the driver (spark-shell) and the
  Workers? Is there a range of ports we can specify in the
  firewall/iptables?

  Any ideas?


Re: Securing Spark's Network

2014-04-25 Thread Jacob Eisinger

Howdy Akhil,

Thanks - that did help!  And, it made me think about how the EC2 scripts
work [1] to set up security.  From my understanding of EC2 security groups
[2], this just sets up external access, right?  (This has no effect on
internal communication between the instances, right?)

I am still confused as to why I am seeing the workers open up new ports for
each job.

Jacob

[1] https://github.com/apache/spark/blob/master/ec2/spark_ec2.py#L230
[2]
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-network-security.html#default-security-group

Jacob D. Eisinger
IBM Emerging Technologies
jeis...@us.ibm.com - (512) 286-6075



From:   Akhil Das ak...@sigmoidanalytics.com
To: user@spark.apache.org
Date:   04/25/2014 12:51 PM
Subject:Re: Securing Spark's Network
Sent by:ak...@mobipulse.in



Hi Jacob,

This post might give you a brief idea about the ports being used

https://groups.google.com/forum/#!topic/spark-users/PN0WoJiB0TA





On Fri, Apr 25, 2014 at 8:53 PM, Jacob Eisinger jeis...@us.ibm.com wrote:
  Howdy,

  We tried running Spark 0.9.1 stand-alone inside docker containers
  distributed over multiple hosts. This is complicated due to Spark opening
  up ephemeral / dynamic ports for the workers and the CLI.  To ensure our
  docker solution doesn't break Spark in unexpected ways and maintains a
  secure cluster, I am interested in understanding more about Spark's
  network architecture. I'd appreciate it if you could you point us to any
  documentation!

  A couple specific questions:
 1. What are these ports being used for?
Checking out the code / experiments, it looks like asynchronous
communication for shuffling around results. Anything else?
 2. How do you secure the network?
Network administrators tend to secure and monitor the network at
the port level. If these ports are dynamic and open randomly,
firewalls are not easily configured and security alarms are raised.
Is there a way to limit the range easily? (We did investigate
setting the kernel parameter ip_local_reserved_ports, but this is
broken [1] on some versions of Linux's cgroups.)

  Thanks,
  Jacob

  [1] https://github.com/lxc/lxc/issues/97

  Jacob D. Eisinger
  IBM Emerging Technologies
  jeis...@us.ibm.com - (512) 286-6075