understanding rdd pipe() and bin/spark-submit --master

Andy Davidson Sat, 20 Sep 2014 12:52:11 -0700

Hi 

I am new to spark and started writing some simple test code to figure out
how things works. I am very interested in spark streaming and python. It
appears that streaming is not supported in python yet.  The work around I
found by googling  is to write your streaming code in either Scala or Java
and use RDD pipe() to fetch the data into your python app. I do not think I
am getting parallel execution. In my current experiment I am using a mac
book pro with 8 cores. I wrote a java job that process a small data file on
disk, uses a couple of transformations  and writes the data to standard out.


I have a simple python script that calls pipe()

import sys

from operator import add

from pyspark import SparkContext



sc = SparkContext(appName="pySparkPipeJava")

data = [1, 2, 3, 4, 5]

distData = sc.parallelize(data)

output = distData.pipe("../bin/runJavaJob.sh").collect()

# pipe() returns strings

for (num) in output:

    print "pySpark: value from java job %s" % (num)

   
I  submit the python job as follows
$ $SPARK_HOME/bin/spark-submit --master local[1] pySparkPipeJava.py



Everything works as expected as long as I only use a single core. If I use 4
cores I get back 4 copies of data. My understanding is that the shell script
will execute on all the workers. In general I want all the transforms and
actions to run in parallel how ever I only want to process a single set of
data in my python script.



Here is the code for runJavaJob.sh

$SPARK_HOME/bin/spark-submit \

  --class ³myJavaSrc" \

  --master local[4] \

  $jarPath



What is really strange is if I replace runJavaJob.sh with a simple shell
script that basically just echoes I can run with 4 cores and only get back
one set of data. Any idea what the difference is? (my java child does not
read anything from the python app, the echo script does)



I tried changing the number of cores in runJavaJob.sh but that does not seem
to change how much data I get back



Seems like being limited to a single core would be severely limiting.



Here is the code for my ³echo² script



#!/bin/sh 



#

# Use this shell script to figure out how spark RDD pipe() works

#



#set -x # turns shell debugging on

#set +x # turns shell debugging off



while read x ; 

do 

echo RDDPipe.sh $x ;

Done


Thanks in advance

Andy

understanding rdd pipe() and bin/spark-submit --master

Reply via email to